python

316 Topics

Azure Functions MCP extension now supports MCP Prompts
We are thrilled to announce that the MCP prompt trigger is now available in public preview in the Azure Functions MCP extension! With this release, the extension now supports all three core MCP server primitives - tools, resources, and prompts, giving you a complete platform for building rich MCP servers on Azure Functions. In case you missed it, the MCP resource trigger is generally available for serving resources and building interactive UIs in MCP Apps. What are MCP Prompts In the Model Context Protocol (MCP), prompts are reusable templates that allow server authors to provide parameterized prompts for a domain, or showcase how to best use the MCP server. Prompts are user-controlled in that they require explicit invocation rather than automatic triggering, and can be context-aware, referencing available resources and tools to create comprehensive workflows. Unlike tools (which are model-controlled) and resources (which are application-controlled), prompts are exposed from servers to clients so users can explicitly select them. Applications typically expose prompts through slash commands, command palettes, dedicated UI buttons, or context menus. How It Works In Python, defining a prompt is as simple as decorating a function. Here's a prompt that returns a code review checklist: app.mcp_prompt_trigger( arg_name="context", prompt_name="code_review_checklist", description="Returns a structured code review checklist prompt for evaluating code changes." ) def code_review_checklist(context: func.PromptInvocationContext) -> str: logging.info("Code review checklist prompt invoked.") return """You are a senior software engineer performing a code review. Use the following checklist to evaluate the code: 1. **Correctness** — Does the code do what it's supposed to? 2. **Error Handling** — Are edge cases and failures handled? 3. **Security** — Are there any vulnerabilities (injection, auth, secrets)? 4. **Performance** — Are there obvious inefficiencies? 5. **Readability** — Is the code clear and well-named? 6. **Tests** — Are there adequate tests for the changes? Provide your feedback in a structured format with a severity level (critical, warning, suggestion) for each finding.""" Prompts can accept arguments, allowing clients to customize the generated message. Here's a prompt that generates documentation with configurable parameters: app.mcp_prompt_trigger( arg_name="context", prompt_name="generate_documentation", prompt_arguments=[ func.PromptArgument("function_name", "The name of the function to document.", required=False), func.PromptArgument("style", "Documentation style: 'concise', 'detailed', or 'tutorial'.", required=False) ], description="Generates API documentation for a function. Arguments are configured in Program.cs." ) def generate_documentation(context: func.PromptInvocationContext) -> str: function_name = context.arguments.get("function_name", "(unknown)") style = context.arguments.get("style", "concise") logging.info(f"Generate docs prompt invoked for function: {function_name}") return f"""Generate API documentation for the function named **{function_name}**. Documentation style: **{style}** Include the following sections: - **Description** — What the function does. - **Parameters** — List each parameter with its type and purpose. - **Return Value** — What the function returns. - **Example Usage** — A short code example showing how to call it.""" Checkout the Get Started section for the complete sample and samples in different languages. Why Azure Functions Azure Functions is the ideal platform for hosting remote MCP servers because of its built-in MCP authentication, event-driven scaling from 0 to N, and serverless billing. This ensures your agentic tools are secure, cost-effective, and ready to handle any load. With the MCP extension, you focus on implementing the primitives you want to expose, tools, resources, and prompts, instead of worrying about MCP protocol details and server logistics. Get Started You can start building today using our quickstarts and samples: Python TypeScript .NET Java Documentation Azure Functions MCP extension overview Prompt trigger We'd Love to Hear from You! Let us know your thoughts about the new prompt trigger. What kinds of prompts are you building for your MCP servers? What would you like us to prioritize next? Share your feedback in our GitHub repo.
lily-ma
May 29, 2026 Place Apps on Azure Blog
279Views
0likes
0Comments
Building Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.
ChaitanyaThalloory
May 28, 2026 Place Microsoft Developer Community Blog
222Views
0likes
0Comments
Azure AI Model Inference API
The Azure AI Model Inference API provides a unified interface for developers to interact with various foundational models deployed in Azure AI Studio. This API allows developers to generate predictions from multiple models without changing their underlying code. By providing a consistent set of capabilities, the API simplifies the process of integrating and switching between different models, enabling seamless model selection based on task requirements.
Sharda_Kaur
May 25, 2026 Place Educator Developer Blog
4.5KViews
0likes
2Comments
Building an End-to-End Azure RAG Strategy Agent with MS Foundry
High-Level Architecture This architecture represents an end-to-end Retrieval-Augmented Generation (RAG) pipeline where raw documents are ingested from Azure Blob Storage, processed using Document Intelligence, transformed into embeddings via Azure OpenAI, and indexed in Azure AI Search for hybrid retrieval. A Foundry/MAF-based agent orchestrates query processing by combining user input with relevant search results and generates contextual responses, which are exposed through a FastAPI or CLI interface. This solution is composed of two main layers: 1. Data Ingestion Layer (RAG Pipeline) This layer transforms raw enterprise documents into searchable knowledge. Flow: Raw documents stored in Azure Blob Storage Supported formats: PDF, DOCX, PPTX, images, etc. Document Intelligence extraction Extracts: Text Tables Key-value pairs Structure Writes output as structured JSON back to Blob (processed/) Chunking + Embedding Documents are split into chunks Each chunk is embedded using Azure OpenAI (text-embedding-*) Indexing into Azure AI Search Creates a hybrid index: Keyword search Semantic ranking Vector search Enables flexible retrieval strategies 2. Query Layer (Strategy Agents) This layer enables intelligent query answering. Flow: User sends a query via: FastAPI endpoint CLI interface Query is handled by: Microsoft Agent Framework (MAF) agent Running on Azure AI Foundry Agent: Queries Azure AI Search Retrieves top relevant chunks Injects them into LLM prompt LLM generates grounded response This follows the standard RAG pattern: Retrieval → Augmentation → Generation End-to-End Flow Key Azure Services Used Service Purpose Azure Blob Storage Raw + processed document storage Azure AI Document Intelligence Extract structured content Azure OpenAI Embeddings + LLM generation Azure AI Search Hybrid retrieval engine Azure AI Foundry Agent orchestration Microsoft Agent Framework Agent execution layer Why this Architecture Matters This solution goes beyond basic RAG and provides: Hybrid Retrieval Combines keyword + semantic + vector search Improves recall and accuracy Structured Document Parsing Handles complex enterprise documents Extracts tables and metadata Agent-Based Orchestration Enables reasoning over retrieval results Extensible for multi-agent workflows Scalable Data Pipeline Supports continuous ingestion Works with large document collections Enterprise Considerations Use Managed Identity for secure service access Apply RBAC on Cosmos DB / Search / Storage Enable Private Endpoints for network isolation Use Guardrails + Evaluations in Foundry Summary This repository demonstrates a production-ready Azure RAG architecture: Ingest → Extract → Chunk → Embed → Index Retrieve → Reason → Generate Powered by Azure AI Foundry + Agent Framework By combining data engineering + AI orchestration, it enables enterprise AI systems that are: Accurate Grounded Extensible Repo: https://github.com/snd94/azure-rag-strategy-agent Please refer to the Microsoft Learn Documentation for further information: Azure AI Search documentation - Azure AI Search | Microsoft Learn Document Intelligence documentation - Quickstarts, Tutorials, API Reference - Foundry Tools | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn Microsoft Agent Framework Overview | Microsoft Learn What is Microsoft Foundry? - Microsoft Foundry | Microsoft Learn
SHAILESHDEVADIGA
May 25, 2026 Place Microsoft Developer Community Blog
399Views
0likes
0Comments
Building AI Agents with Microsoft Foundry: A Progressive Lab from Hello World to Self-Hosted
AI agent development has a steep on-ramp. The combination of new SDKs, tool-calling patterns, model selection decisions, retrieval-augmented generation, and deployment concerns means most developers spend more time wiring things together than actually building anything useful. The Microsoft Foundry Agent Lab is a structured, open-source demo series designed to change that — nine self-contained demos, each adding exactly one new concept, all built on the same Microsoft Foundry SDK and a single model deployment. This post walks through what the lab contains, how each demo works under the hood, and the architectural decisions that make it a useful reference for AI engineers building production agents. Why a Progressive Lab? Agent frameworks can be overwhelming. A developer who opens a rich example with RAG, tool-calling, streaming, and a custom UI all at once has no clear line of sight to which parts are essential and which are embellishments. The Foundry Agent Lab takes the opposite approach: start with the absolute minimum and introduce one new primitive per demo. By the time you reach Demo 8, you have seen every major capability — not in one monolithic sample, but in a layered sequence where each addition is visible and understandable. # Demo New Concept Tool Used UX 0 hello-demo Agent creation, Responses API, conversations None Terminal 1 tools-demo Function calling, tool-calling loop, live API FunctionTool Terminal 2 desktop-demo UI decoupling — same agent, different surface None Desktop (Tkinter) 3 websearch-demo Server-side built-in tools, no client loop WebSearchTool Terminal 4 code-demo Code execution in sandbox, Gradio web UI CodeInterpreterTool Web (Gradio) 5 rag-demo Document upload, vector stores, RAG grounding FileSearchTool Terminal 6 mcp-demo MCP servers, human-in-the-loop approval MCPTool Terminal 7 toolbox-demo Centralized tool governance, Toolbox versioning Toolbox Terminal 8 hosted-demo Self-hosted agent with Responses protocol Custom server Terminal + Agent Inspector The Model Router: One Deployment to Rule Them All Before diving into the demos, it is worth understanding the one architectural decision that ties the entire lab together: every agent uses model-router as its model deployment. MODEL_DEPLOYMENT=model-router Model Router is a Microsoft Foundry capability that inspects each request at inference time and routes it to the optimal available model — weighing task complexity, cost, and latency. A simple factual question goes to a fast, cheap model. A complex tool-calling chain with code generation gets routed to a frontier model. You write zero routing logic. The lab's MODEL-ROUTER.md file contains empirical observations from running all nine demos. A sample of what the router selected: Demo Query Task Type Model Selected hello "What's the capital of WA state?" Factual recall grok-4-1-fast-reasoning hello "Summarize our conversation" Summarization gpt-5.2-chat-2025-12-11 tools "What's the weather in Seattle?" Tool-using gpt-5.4-mini-2026-03-17 code Data analysis with code generation Code generation + execution gpt-5.4-2026-03-05 rag HR policy document question Retrieval + synthesis gpt-5.3-chat-2026-03-03 This is the strongest signal in the lab: you do not need to reason about model selection. You declare what your agent needs to do; the router handles the rest, and it chooses correctly. Demo 0: The Minimum Viable Agent The hello-demo establishes the baseline pattern used by every subsequent demo. Two files: one to register the agent, one to chat with it. Registering the agent from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient from azure.ai.projects.models import PromptAgentDefinition credential = DefaultAzureCredential() project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, instructions="You are a helpful, friendly assistant.", ), ) Authentication uses DefaultAzureCredential , which works with az login locally and with managed identity in production — no API keys anywhere in the code. Chatting with the agent # Create a server-side conversation (persists history across turns) conversation = openai.conversations.create() # Each turn sends the user message; the agent sees full history response = openai.responses.create( input=user_input, conversation=conversation.id, extra_body={"agent_reference": {"name": AGENT_NAME, "type": "agent_reference"}}, ) print(response.output_text) The conversation object is server-side. You pass its ID on every turn; the history lives in Foundry, not in a local list. This is the Responses API pattern — distinct from the older Completions or Chat Completions APIs. Demo 1: Function Tools and the Tool-Calling Loop Demo 1 adds function calling against a real weather API. The key insight here is that the model does not execute the function — it requests the execution, and your code executes it locally, then feeds the result back. Declaring a function tool from azure.ai.projects.models import FunctionTool, PromptAgentDefinition func_tool = FunctionTool( name="get_weather", description="Get the current weather for a given city.", parameters={ "type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"], }, strict=True, ) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[func_tool], instructions="You are a weather assistant...", ), ) The tool-calling loop response = openai.responses.create(input=user_input, conversation=conversation.id, ...) # Loop while the model is requesting tool calls while any(item.type == "function_call" for item in response.output): input_list = [] for item in response.output: if item.type == "function_call": args = json.loads(item.arguments) result = get_weather(args["city"]) # execute locally input_list.append(FunctionCallOutput(call_id=item.call_id, output=result)) # Send results back to the agent response = openai.responses.create(input=input_list, conversation=conversation.id, ...) print(response.output_text) The strict=True parameter on FunctionTool enforces structured outputs — the model must return arguments that match the declared JSON schema exactly. This eliminates argument parsing errors in production. Demo 2: UI Is Not Your Agent Demo 2 runs the exact same agent as Demo 1 but surfaces it in a Tkinter desktop window. The point is pedagogical: your agent definition, conversation management, and tool-calling logic are entirely independent of your UI layer. Swapping from terminal to desktop requires changing only the presentation code — nothing in the agent or conversation path changes. This is a principle worth internalising early: agent logic and UI logic should never be entangled. The lab enforces this separation structurally. Demo 3: Server-Side Built-In Tools The web search demo introduces a sharp contrast with Demo 1. With WebSearchTool , the tool-calling loop disappears entirely from client code: from azure.ai.projects.models import WebSearchTool agent = project.agents.create_version( agent_name="Search-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[WebSearchTool()], instructions="You are a research assistant...", ), ) The agent decides when to search, executes the search server-side, and returns a grounded response with citations. Your client code looks identical to Demo 0 — a simple responses.create() call with no tool loop. The distinction matters architecturally: Function tools (Demo 1) — tool execution happens on your client; you control the code, the API call, the error handling. Built-in tools (Demo 3+) — tool execution happens inside Foundry; you get results without managing execution. Demo 4: Code Interpreter and the Gradio Web UI Demo 4 attaches CodeInterpreterTool , which gives the agent a sandboxed Python execution environment inside Foundry. The agent can write code, run it, observe output, and iterate — all server-side. Combined with a Gradio web interface, this demo shows an agent that can perform data analysis, generate charts, and explain results through a browser UI. Model Router is particularly interesting here: the empirical data shows it selects a more capable frontier model ( gpt-5.4-2026-03-05 ) for code-generation tasks, while simpler conversational turns stay on lighter models. Demo 5: Retrieval-Augmented Generation with FileSearchTool Demo 5 introduces RAG. The setup phase uploads a document, creates a vector store, and attaches it to the agent: # Upload document and create a vector store vector_store = openai.vector_stores.create(name="employee-handbook-store") with open("data/employee-handbook.md", "rb") as f: openai.vector_stores.files.upload_and_poll( vector_store_id=vector_store.id, file=f ) # Attach the vector store to the agent agent = project.agents.create_version( agent_name="RAG-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[FileSearchTool(vector_store_ids=[vector_store.id])], instructions="Answer questions using only the provided documents...", ), ) At query time, the agent embeds the question, searches the vector store semantically, retrieves matching chunks, and generates an answer grounded in the retrieved content — entirely server-side. The client code remains a plain responses.create() call. An important detail: the .vector_store_id file is written to disk during setup and read back during the chat session, so the demo survives process restarts without re-uploading the document. The .gitignore excludes this file from source control. Demo 6: Model Context Protocol Demo 6 connects the agent to a GitHub MCP server, giving it access to repository and issue data via the open Model Context Protocol standard. MCP servers expose tools over a standardised wire protocol; the agent discovers and calls them without any client-side function declarations. The demo also demonstrates human-in-the-loop approval: before executing any MCP tool call, the agent surfaces the proposed action and waits for the user to confirm. This is an important safety pattern for agents that can trigger side effects on external systems. Demo 7: Toolbox — Centralised Tool Governance Where Demo 6 connects to a single MCP server directly, Demo 7 uses a Toolbox — a managed Microsoft Foundry resource that bundles multiple tools into a single, versioned, MCP-compatible endpoint. The Toolbox in this demo exposes both GitHub Issues and GitHub Repos tools, curated into an immutable versioned snapshot. This pattern is significant for production multi-agent systems: Centralised governance — one team owns the tool definitions; all agents consume them via a single endpoint. Versioned snapshots — promoting a new Toolbox version is explicit; agents pin to a version and upgrade intentionally. MCP compatibility — any MCP-capable agent or framework can connect, not just Foundry SDK agents. from azure.ai.projects.models import McpTool toolbox_tool = McpTool( server_label="toolbox", server_url=TOOLBOX_ENDPOINT, allowed_tools=[], # empty = all tools in the Toolbox version headers={"Authorization": f"Bearer {token}"}, ) Demo 8: Self-Hosted Agent with the Responses Protocol The final demo departs from the prompt-agent pattern. Instead of registering a declarative agent in Foundry, Demo 8 implements a custom agent server using the Responses protocol. The server exposes a streaming HTTP endpoint; Foundry's Agent Inspector can connect to it and route user turns to it just as it would to a hosted prompt agent. This demo includes a Dockerfile and an agent.yaml , enabling deployment to Foundry's container hosting service. It uses gpt-4.1-mini directly rather than the model router, because the custom server owns the entire inference path. When to consider this pattern: Your agent requires custom pre- or post-processing logic that cannot be expressed in a system prompt. You need to integrate with infrastructure that is not reachable through MCP or built-in tools. You want to own the inference call for cost control, A/B testing, or compliance reasons. You are building a multi-agent orchestrator that needs to expose itself as an agent to other orchestrators. Getting Started The lab requires Python 3.10 or higher, an Azure subscription with a Microsoft Foundry project, and the Azure CLI. 1. Clone and set up the virtual environment git clone https://github.com/microsoft-foundry/Foundry-Agent-Lab.git cd Foundry-Agent-Lab # Create and activate the virtual environment python -m venv .venv # Windows Command Prompt .venv\Scripts\activate.bat # Windows PowerShell .venv\Scripts\Activate.ps1 # macOS / Linux source .venv/bin/activate pip install -r requirements.txt 2. Configure a demo copy hello-demo\.env.sample hello-demo\.env # Edit hello-demo\.env and set PROJECT_ENDPOINT Your PROJECT_ENDPOINT is on the Overview page of your Foundry project in the Azure portal. It takes the form https://your-resource.ai.azure.com/api/projects/your-project . 3. Run the demo az login 0-hello-demo Each numbered batch file at the root activates the virtual environment, runs create_agent.py , and launches chat.py . Append log to capture the full session transcript: 0-hello-demo log Reset between runs hello-demo\reset.bat Every demo includes a reset.bat that deletes the registered agent and any associated resources (vector stores, uploaded files). Demos are fully repeatable. Architecture Principles Demonstrated Across the nine demos, the lab illustrates a set of design principles that apply directly to production agent systems: Keyless authentication throughout Every demo uses DefaultAzureCredential . No API keys appear anywhere in the code. Locally, az login provides credentials. In production, managed identity takes over automatically — same code, no secrets to rotate. Server-side conversation state The Responses API stores conversation history server-side. Your application passes a conversation ID; Foundry maintains the thread. This eliminates the common bug of truncating history due to local list management and makes multi-process or multi-instance deployments straightforward. Client-side vs server-side tool execution The lab makes the distinction explicit. Function tools execute in your process — you control the code, the external call, and the error handling. Built-in tools (WebSearch, CodeInterpreter, FileSearch) execute inside Foundry — you get results without managing execution infrastructure. MCP tools (Demo 6, 7) fall between these: they execute in a separately deployed server, with the protocol mediating the call. Progressive tool introduction Each demo's create_agent.py registers the agent once. The chat.py file handles the conversation loop. These two responsibilities are always separate, making it easy to update agent definitions without modifying conversation logic, and vice versa. Security Considerations When building agents for production, keep the following in mind: Never commit .env files. The .gitignore excludes them, but verify this before pushing. Use Azure Key Vault or environment variable injection in CI/CD pipelines. Use managed identity in production. DefaultAzureCredential automatically picks up managed identity when deployed to Azure, eliminating the need for any stored credentials. Apply human-in-the-loop for side-effecting tools. Demo 6 demonstrates this pattern for MCP tool calls. Any agent that can modify external state (create issues, send emails, write files) should surface proposed actions for confirmation. Validate tool outputs before use. Treat data returned by external tools (weather APIs, search results, document retrieval) as untrusted input. Prompt injection through tool results is a real attack surface; grounding instructions in your system prompt reduce but do not eliminate this risk. Scope Toolbox permissions narrowly. When using a Toolbox (Demo 7), use allowed_tools to restrict which tools the agent can call, rather than granting access to all tools in a Toolbox version. Key Takeaways Start with the minimum. A prompt agent with no tools requires fewer than 30 lines of code using the Foundry SDK. Add tools only when the use case demands them. Use model-router unless you have a specific reason not to. The empirical data in the lab shows the router selects appropriate models across all task types — factual, creative, tool-calling, RAG, and code generation. Understand the client/server tool boundary. Function tools give you control; built-in tools give you simplicity. MCP and Toolbox give you governance and interoperability. Choose based on where you need control and where you need scale. Conversation state belongs on the server. Do not maintain conversation history in application memory if you can avoid it. The Responses API conversation object is designed for this. The hosted-demo pattern is for when you need to own the inference path. For most use cases, a declarative prompt agent is sufficient and far simpler to operate. Next Steps Explore the repo: github.com/microsoft-foundry/Foundry-Agent-Lab Microsoft Foundry SDK documentation: learn.microsoft.com/azure/ai-studio/ Responses API quickstart: Prompt agent quickstart Model Router conceptual documentation: Model Router for Microsoft Foundry Model Context Protocol: modelcontextprotocol.io Azure Identity SDK (DefaultAzureCredential): azure-identity Python SDK The Foundry Agent Lab is open source under the MIT licence. Contributions, bug reports, and feature requests are welcome through GitHub Issues. See CONTRIBUTING.md for guidelines.
Lee_Stott
May 21, 2026 Place Microsoft Developer Community Blog
3.9KViews
1like
0Comments
You Can Scale MCP Servers Behind a Load Balancer on App Service — Here's How
Most MCP servers in the wild are single-instance processes. That's fine when they're driving a local Claude or VS Code session — but it's the wrong shape for a production agent fleet that has to absorb traffic spikes, ride through deploys, and survive instance failures. The good news: the MCP spec already grew up. The 2025-06-18 revision formalizes stateless HTTP transport (and the current 2025-11-25 revision keeps it), which means a single request carries everything the server needs to answer. No long-lived connection, no in-process session table, no sticky-session hacks to keep a client glued to one box. That tiny protocol change unlocks something big: you can stick an MCP server behind App Service's built-in load balancer and scale it like any other web API. This post walks through how, with a runnable sample. Sample: seligj95/app-service-mcp-stateless-scale-python. One azd up and you have a stateless FastAPI MCP server running on three App Service instances behind the platform load balancer, with a staging slot, Application Insights, and a k6 script that visualizes load distribution from the client side. Why "stateless" is the whole story Earlier MCP transports leaned on persistent connections — SSE channels and WebSocket-style sessions where the server held per-client state in memory (open tools, subscriptions, partial streams). That model is great for a local IDE talking to a local process. It's hostile to load balancing, because routing a follow-up request to a different instance breaks the session. The stateless HTTP transport flips that. Each request is a complete JSON-RPC envelope ( initialize , tools/list , tools/call ), every response is self-contained, and the server is allowed to forget the client between requests. Any instance can serve any call. That is the property a load balancer needs. In the sample, every tool is a pure function of its arguments — whoami reports the serving instance, lookup_fact reads a static dictionary, compute_primes runs a sieve. None of them touches per-client memory. That's not a constraint of the protocol; it's a discipline you adopt to keep statelessness intact. Why App Service, and not Functions or AKS Functions and AKS are a couple of the many great options for MCP server hosting depending on what the MCP server is used for. The use case we are discussing here is a scaled MCP server, i.e. an MCP server that must reach a large and broad audience. Here are a few defaults that make App Service a solid option for this scenario: Always On. Reasoning tools call into LLMs and external APIs; latencies routinely sit in the multi-second range. Functions caps a single execution at ten minutes by default (and aggressively scales workers to zero between bursts, which kills warm caches). App Service keeps the process resident. Horizontal scale is one parameter. Pick a Premium SKU, set the plan's capacity to N, and you have N instances behind a managed load balancer. No VMSS to declare, no ingress controller to wire up, no Service to reconcile. Deployment slots. Swap a warmed-up staging slot into production for zero-downtime deploys. Critical when your "API" is an LLM tool surface that an agent is actively driving. Easy Auth. OAuth 2.1 in front of the MCP endpoint without writing the flow yourself — turn on the App Service authentication blade and point it at Entra ID. The sample leaves this off so the deploy is one command, but the wiring is a checkbox away. The TL;DR: it's PaaS that already knows how to run a stateful long-lived process at horizontal scale, which is exactly the shape of a scaled MCP server. The FastAPI MCP server, end-to-end stateless The whole transport is one POST handler. The full source is in main.py , but here are the load-bearing pieces: @app.post("/mcp") async def mcp_endpoint(request: Request): body = await request.json() method = body.get("method", "") msg_id = body.get("id") if method == "initialize": return {"jsonrpc": "2.0", "id": msg_id, "result": _server_info()} if method == "tools/list": return {"jsonrpc": "2.0", "id": msg_id, "result": {"tools": [...]}} if method == "tools/call": params = body.get("params", {}) result = await MCP_TOOLS[params["name"]]["function"](**params.get("arguments", {})) return { "jsonrpc": "2.0", "id": msg_id, "result": {"content": [{"type": "text", "text": json.dumps(result)}]}, } There is no session table. There is no client_id cookie. There is no AsyncIterator held open between requests. initialize , tools/list , and tools/call all return in a single round trip, which is the shape App Service's load balancer expects. The most useful debugging tool in the sample is whoami : async def tool_whoami() -> Dict[str, Any]: return { "instance_id": os.environ.get("WEBSITE_INSTANCE_ID", "local"), "hostname": socket.gethostname(), ... } WEBSITE_INSTANCE_ID is unique per App Service worker. Call whoami a few times from your MCP client and the value rotates — that's the load balancer working. If it doesn't rotate, something is pinning your traffic (almost always the ARR Affinity cookie; we'll get there). The Bicep that actually makes it scale The infra is a P0v3 plan with capacity: 3 , a web app with affinity disabled, and a staging slot on the same plan: resource appServicePlan 'Microsoft.Web/serverfarms@2024-04-01' = { name: name sku: { name: 'P0v3' capacity: instanceCount // 3 by default } properties: { reserved: true } } resource web 'Microsoft.Web/sites@2024-04-01' = { name: name properties: { serverFarmId: appServicePlanId httpsOnly: true clientAffinityEnabled: false // ← the one line that matters siteConfig: { linuxFxVersion: 'PYTHON|3.11' alwaysOn: true healthCheckPath: '/health' appCommandLine: 'python -m uvicorn main:app --host 0.0.0.0 --port 8000' } } } resource staging 'Microsoft.Web/sites/slots@2024-04-01' = { parent: web name: 'staging' properties: { /* same shape — separate hostname, same plan */ } } The single most important line in that template is clientAffinityEnabled: false . App Service defaults to on, which sets the ARRAffinity cookie and pins every subsequent request from a given client to the instance that handled the first one. That default exists because legacy ASP.NET apps used in-process session state. Stateless MCP does not. Leaving affinity on silently undoes everything we just built. Premium v3 (P0v3) is the floor for two reasons: it gives Always On and unlocks deployment slots. Below that tier you don't get either. Application Insights without writing telemetry code The sample drops one line of bootstrap into main.py : from azure.monitor.opentelemetry import configure_azure_monitor if os.environ.get("APPLICATIONINSIGHTS_CONNECTION_STRING"): configure_azure_monitor(logger_name="mcp") The Azure Monitor OpenTelemetry distro auto-instruments FastAPI and outbound HTTP. Every request span App Service emits is tagged with cloud_RoleInstance , which Application Insights populates from WEBSITE_INSTANCE_ID . That makes the question "is traffic actually spreading across my instances?" a one-liner in Logs: requests | where timestamp > ago(15m) | where name contains "/mcp" | summarize count() by cloud_RoleInstance | order by count_ desc If you see three roughly-equal rows, you're done. If you see one row, your client is sending ARRAffinity cookies — turn affinity off and redeploy. Deploy azd auth login azd up That provisions the resource group, plan, web app, staging slot, Log Analytics workspace, and Application Insights resource, then deploys the Python app via Oryx. The output prints both WEB_URI and WEB_STAGING_URI . Open the production URI — the home page renders the instance ID that served it. Refresh. The ID changes. To swap the staging slot into production with no downtime: az webapp deployment slot swap \ --resource-group <rg> --name <app> \ --slot staging --target-slot production App Service warms the staging instances, redirects traffic, and the old production becomes the new staging — the classic blue-green pattern, but free. Prove it scales The sample ships a k6 script that hammers /mcp with tools/call requests and tags every response with the instance_id the server returned: BASE_URL=https://<your-app>.azurewebsites.net \ k6 run --summary-export=summary.json loadtest/k6-mcp.js jq '.metrics.mcp_instance_hits.values' summary.json The output groups hits per instance tag. On a three-instance plan with a 60-second steady load you should see something close to: { "count": 1842, "instance0d3e2f...": 614, "instance7a91bc...": 612, "instance19f0c4...": 616 } Roughly 33% on each box — the App Service load balancer round-robining new connections, with no help from the application. What I'd do next The sample is intentionally a starting point. Two extensions are the obvious next moves: Add Easy Auth. Turn on App Service authentication, pick Entra ID, require auth on /mcp . The token surfaces as headers; your tool handlers can use it to identify the calling agent without you owning any of the OAuth machinery. Autoscale on CPU. instanceCount: 3 is a starting point. Wire up Microsoft.Insights/autoscalesettings against the plan and let it scale 3 → 10 on the prime-counting tool. The architecture already supports it — that's the whole point of stateless. Try it Sample repo: github.com/seligj95/app-service-mcp-stateless-scale-python MCP spec: modelcontextprotocol.io/specification/2025-11-25 App Service docs: learn.microsoft.com/azure/app-service/overview If you ship something with it, I'd love to hear how it held up.
jordanselig
May 20, 2026 Place Apps on Azure Blog
162Views
0likes
0Comments
Debugging Python apps on App Service with the new SSH helper aliases
You shipped a Python app to App Service. It worked in the demo. It works locally. In production, /chat is returning 502s — but /health is green, the deployment succeeded, the logs are quiet, and your laptop can't reproduce it. What you actually need is a shell on the running container so you can poke at DNS, env vars, installed packages, the listening port, and the AI endpoint your app is calling. The platform has had SSH for a while, but the playbook of "open SSH, then remember which 14 commands to run" was tribal knowledge. We just shipped a set of SSH helper aliases that turn that tribal knowledge into one-word commands. apphelp shows you everything; appconfig , showpkgs , and appcurl cover the app side; ai-test , ai-diagnose , ai-curl , ai-latency , ai-dns , and ai-access-check cover the Azure AI Foundry side. This post is a hands-on tour. We built a deliberately fragile FastAPI sample with six different fault modes, deployed it, broke it, and SSH'd in to watch the aliases drive each one to root cause. Every transcript below is real output from the deployed sample. 📦 Sample repo: seligj95/app-service-ssh-diagnostics-python — azd up and you have a fault-injectable Python + Foundry app live in your subscription in about 4 minutes. The sample, in one breath FastAPI app, Python 3.14, App Service Linux on P0v3 — uses the new Oryx FastAPI auto-detection so no custom startup command is needed Calls Azure OpenAI (gpt-4o-mini) via managed identity — no keys POST /admin/fault toggles one of seven modes: off , bad-creds , wrong-endpoint , dns-fail , port-mismatch , dep-import-error , latency-spike GET / is a landing page with a built-in cheat sheet of the SSH aliases The endpoints are intentionally boring. The point is to give the aliases something realistic to chew on. A quick note on Azure OpenAI vs. AI Foundry. This sample provisions an Azure OpenAI account ( kind: OpenAI ). The new ai-* aliases speak the OpenAI chat-completions API ( /openai/deployments/<model>/chat/completions ), which is identical on Azure OpenAI and on Azure AI Foundry projects — both expose *.openai.azure.com endpoints, both accept managed-identity bearer tokens, both speak the same schema. The aliases work against either; the env-var name AZURE_AI_FOUNDRY_ENDPOINT is just the alias contract. Drop a Foundry endpoint into it and the same walkthrough applies. Shout-out to the new FastAPI auto-detect on Python 3.14. This sample also benefits from another recent App Service change: on Python 3.14+, App Service automatically detects FastAPI apps and starts them with gunicorn -k uvicorn_worker.UvicornWorker — no custom startup command needed. Our Bicep ships an empty appCommandLine and lets Oryx do the right thing. The whole sample is a nice tour of recent App Service Python improvements landing together. Step zero: apphelp After azd up finishes, the first thing to do over SSH is: az webapp ssh -g rg-ssh-diag-demo -n app-web-<token> Then inside the container: $ apphelp apphelp prints every alias the image ships with, grouped by category. You don't need to memorize anything — when you forget what checkport does, you run apphelp and it's right there. We'll lean on most of these: App info: showpkgs , appconfig , appenv Logs: applogs , deploylogs , logfiles Reachability: appcurl , checkport , gohome , gosrc AI/Foundry: ai-test , ai-dns , ai-access-check , ai-curl , ai-latency , ai-diagnose Network tools: install-nettools The healthy baseline Before breaking anything, run ai-diagnose . This is the one-shot "is my AI path healthy?" check, and it's the alias we reach for most: $ ai-diagnose ──────────────────────────────────────────────────────────────── AI Foundry Diagnostics ──────────────────────────────────────────────────────────────── [✓] Managed identity token [✓] DNS resolution (d8f9grasb7ewc7h8.ai-gateway.eastus2-01.azure-api.net. - public) [✓] Foundry connectivity (761ms) ──────────────────────────────────────────────────────────────── Three green checks tell you three different things: the managed identity is issuing tokens, the Foundry hostname resolves, and the endpoint responded in a reasonable time. If any of these are red, you already know which layer the fault is in. For more detail, the individual aliases are worth knowing: $ ai-test ✓ Connected | 1009ms | Model: gpt-4o-mini | Auth: Managed Identity $ ai-access-check ✓ Foundry endpoint: https://cog-ftirxupt2yjoe.openai.azure.com/ ✓ Model: gpt-4o-mini ✓ Using auth mode: Managed Identity ✓ Access check passed: authorized to call Foundry $ ai-latency Running 5 requests to gpt-4o-mini... Request 1: 679ms ✓ Request 2: 826ms ✓ Request 3: 758ms ✓ Request 4: 641ms ✓ Request 5: 664ms ✓ Results (5/5 successful): Avg: 713ms | Min: 641ms | Max: 826ms And the app side: $ checkport ✓ App is listening on port 8000 $ appcurl /health HTTP Status: 200 Time: 0.002417s Size: 5423 bytes That's our "everything is fine" reference. Now let's break things. One trick: applying a fault inside the SSH shell A subtle thing trips people up the first time. POST /admin/fault mutates the app process's environment — but your SSH shell is a separate process. It inherited the container's env when you opened the session, so ai-test will still see the healthy values. The sample handles this by also writing a small file to the persistent share: # app/faults.py def _write_env_file() -> None: """Write fault env to /home/site/diagnostics/fault.env so SSH can `source` it.""" diag = Path("/home/site/diagnostics") diag.mkdir(parents=True, exist_ok=True) snap = _snapshot_unlocked() lines = [f"# Active fault: {snap['mode']}", ""] for k, v in snap["env"].items(): lines.append(f"export {k}={shlex.quote(v) if v else "''"}") (diag / "fault.env").write_text("\n".join(lines) + "\n") After toggling a fault, run this once in your SSH session: source /home/site/diagnostics/fault.env Now the aliases see the same env the broken app sees. This pattern — flip a flag from outside, source the change inside — is worth stealing for your own debugging workflows. Group A: faults the AI aliases catch directly Some faults are in the path between App Service and Foundry — wrong endpoint, broken DNS, network. The ai-* aliases reproduce the failure end-to-end, and they tell you exactly which layer. Fault 1: wrong-endpoint — a typo in the AOAI endpoint The most common AI-side incident: someone fat-fingers an app setting. The endpoint resolves to something (it's still *.openai.azure.com ) but it's not your resource. curl -X POST $URL/admin/fault -H 'content-type: application/json' \ -d '{"mode":"wrong-endpoint"}' curl $URL/chat -H 'content-type: application/json' \ -d '{"prompt":"hi"}' # HTTP 502 # {"detail":"APIConnectionError: Connection error."} SSH in, source the fault env, run the AI aliases: $ source /home/site/diagnostics/fault.env $ ai-dns Resolving: this-resource-does-not-exist.openai.azure.com ✗ DNS resolution failed for this-resource-does-not-exist.openai.azure.com $ ai-curl Request: POST https://this-resource-does-not-exist.openai.azure.com//openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-02-01 Authorization: Bearer [hidden] Content-Type: application/json curl: (6) Could not resolve host: this-resource-does-not-exist.openai.azure.com $ ai-diagnose [✓] Managed identity token [✗] DNS resolution failed for this-resource-does-not-exist.openai.azure.com [✗] Foundry connectivity (HTTP 000) ai-diagnose collapses the whole story into three lines: token works, DNS fails, connectivity fails. The fault is unambiguously a bad endpoint — check appconfig and your Bicep parameters. Fault 2: dns-fail — NXDOMAIN A subtler variant of the same failure mode is when the endpoint is structurally wrong (private endpoint misconfigured, hosts file mishap, custom domain expired). ai-dns calls it out the same way: $ ai-dns Resolving: no-such-host.invalid.example ✗ DNS resolution failed for no-such-host.invalid.example If you need deeper diagnostics — say, you suspect a flaky resolver rather than the hostname itself — install-nettools gives you dig , nslookup , and friends without rebuilding the container. $ install-nettools $ dig openai.azure.com $ nslookup cog-ftirxupt2yjoe.openai.azure.com Group B: faults that pass ai-test but break your app Here's the most useful thing we learned building this sample: ai-test can be green while your app is on fire, and that's a signal, not a bug. The ai-* aliases call Foundry directly. If they're green and your app is red, the platform-to-Foundry path is fine — the divergence is in your app. Time to pivot to appenv , applogs , showpkgs . Fault 3: bad-creds — wrong AZURE_CLIENT_ID This one is the classic user-assigned managed identity mishap: you scoped your code to a user-assigned managed identity, but the GUID in AZURE_CLIENT_ID doesn't actually exist (or wasn't granted RBAC). curl -X POST $URL/admin/fault -d '{"mode":"bad-creds"}' curl $URL/chat -d '{"prompt":"hi"}' # HTTP 502 # {"detail":"ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token..."} Now SSH in and try the AI aliases: $ source /home/site/diagnostics/fault.env $ ai-test ✓ Connected | 734ms | Model: gpt-4o-mini | Auth: Managed Identity $ ai-access-check ✓ Foundry endpoint: https://cog-ftirxupt2yjoe.openai.azure.com/ ✓ Using auth mode: Managed Identity ✓ Access check passed: authorized to call Foundry Both green. That looks like a contradiction, but it's not. The aliases authenticate using the system-assigned managed identity directly (via IMDS), and they pass. Your Python app uses DefaultAzureCredential , which honors AZURE_CLIENT_ID to pick a user-assigned identity — and that one is broken. The takeaway: when ai-test is green but /chat is red, the platform's identity is fine. Pivot to appenv to see exactly what env your app process sees, and check AZURE_CLIENT_ID : $ appenv | grep AZURE_CLIENT_ID AZURE_CLIENT_ID=00000000-0000-0000-0000-000000000000 There's the bug. The aliases didn't fail — they told you the fault isn't in the platform. That's diagnosis by elimination, and it's faster than guessing. Fault 4: dep-import-error — your code throws Same pattern. The app raises an ImportError on /chat , the AI aliases are green: curl -X POST $URL/admin/fault -d '{"mode":"dep-import-error"}' curl $URL/chat -d '{"prompt":"hi"}' # HTTP 500 # {"detail":"ImportError: No module named 'tiktoken'..."} This is where the app-side aliases earn their keep: $ showpkgs | head -20 ────────────────────────────────────────────────────── Virtual environment packages (antenv) ────────────────────────────────────────────────────── Package Version -------------------------------------- --------- annotated-types 0.7.0 anyio 4.13.0 azure-core 1.41.0 azure-identity 1.19.0 azure-monitor-opentelemetry 1.8.8 ... No tiktoken in that list. Confirmation in one command — no need to remember pip list or where the virtualenv lives. deploylogs then tells you what the last deployment actually built: $ deploylogs 10 Latest deployment: b8a64ed4-b6b7-4419-91eb-6d8e4e7ef323 Log file: /home/site/deployments/b8a64ed4-b6b7-4419-91eb-6d8e4e7ef323/log.log 2026-05-18T19:10:52.3844297Z,Parsing the build logs,abc3cf97-... 2026-05-18T19:10:52.5414396Z,Found 0 issue(s),7d11d013-... 2026-05-18T19:10:52.7913394Z,Build Summary :,... 2026-05-18T19:10:53.5643089Z,Deployment successful. deployer = Push-Deployer ... Build was clean. The package just isn't in requirements.txt . Two aliases, one minute, root cause. Fault 5: port-mismatch — uvicorn binds the wrong port A real-world bug: someone sets WEBSITES_PORT=9999 in app settings to expose a different port, but the app still binds to 8000. curl -X POST $URL/admin/fault -d '{"mode":"port-mismatch"}' The aliases tell you exactly which port everything sees: $ checkport Checking if app is listening on port 8000... ✓ App is listening on port 8000 $ appcurl /health Testing app at localhost:8000 ... HTTP Status: 200 Time: 0.002417s $ appconfig PORT Value: 8000 Note: The port your Python app should listen on. Default is 8000. The app is healthy from inside the container. The mismatch is between what the platform tries to forward to and what uvicorn is bound to. This is the kind of fault where curling the public URL fails but appcurl /health succeeds — and the contrast is itself the diagnosis. Fault 6: latency-spike — the alias bench is fast, your app is slow The app injects 4 seconds of asyncio.sleep before each Foundry call. /chat is now ~4.5 seconds. ai-latency : $ ai-latency Running 5 requests to gpt-4o-mini... Request 1: 715ms ✓ Request 2: 588ms ✓ Request 3: 578ms ✓ Request 4: 669ms ✓ Request 5: 643ms ✓ Results (5/5 successful): Avg: 638ms | Min: 578ms | Max: 715ms Foundry, from this instance, averages 638ms. If your app is taking 5 seconds end-to-end and ai-latency says the model is sub-second, the slowness is in your code — not in Foundry, not in the network. Time to look at App Insights end-to-end transactions, or at any pre-call work (retrieval, vector lookup, your own sleep). What this changes about the debugging workflow Before these aliases, the SSH playbook for a Python AI app went something like: open SSH, dig around /home/site/wwwroot/antenv , grep applicationHost.config for ports, write a curl by hand against the AOAI endpoint with a manually-fetched managed identity token, hope you got the API version right. Now it's ai-diagnose . If that's red, you know exactly which layer. If it's green, you know the fault is in your code or your settings, and appenv , appconfig , showpkgs , applogs walk you the rest of the way. Three patterns we'd lean on going forward: Start with apphelp and ai-diagnose every time. Don't try to remember the right command — let the aliases tell you. Treat ai-test being green as a signal, not a finish line. If /chat is red and ai-test is green, the platform path is fine; pivot to app-side aliases. Use source /home/site/diagnostics/fault.env as a pattern. Any time you want your SSH shell to see what the app process sees, write env to a file and source it. It's a small thing that removes a huge class of "but it worked when I tested it" confusions. We want feedback The aliases are GA today on Python images and we have ideas for where they go next — Node, .NET, more ai-* checks (Foundry agents, vector indexes), tighter integration with azd diagnose . If you have a Python app on App Service and you want a specific alias added, tell us by dropping a comment on this post. Try the sample git clone https://github.com/seligj95/app-service-ssh-diagnostics-python cd app-service-ssh-diagnostics-python azd auth login azd up Four minutes later you'll have the whole thing live. Then curl -X POST $URL/admin/fault -d '{"mode":"<pick one>"}' , SSH in, and walk through any of the six faults above. The README has the full alias-to-fault map.
jordanselig
May 19, 2026 Place Apps on Azure Blog
132Views
0likes
0Comments
Confidence-Aware RAG: Teaching Your AI Pipeline to Acknowledge Uncertainty
Introduction Retrieval-Augmented Generation (RAG) has become the standard architecture for grounding Large Language Models (LLMs) with enterprise data. By retrieving relevant documents before generating a response, RAG helps reduce hallucinations compared to relying on model knowledge alone. However, an important limitation remains in most implementations: RAG systems can produce confident-sounding answers even when the underlying data is incomplete, irrelevant, or missing. This happens when: • Retrieved documents are loosely related to the query • The answer exists partially but lacks key details • Retrieved sources contradict each other • The query falls entirely outside the knowledge base In enterprise environments, this behavior carries real risk. A reliable AI system must not only answer well - it must also know when not to answer. This article presents a practical confidence-aware RAG architecture using three layered strategies: retrieval confidence scoring, citation validation, and LLM-based abstention - all implemented with Azure AI Search and Azure OpenAI. The Problem: Confident Hallucination Consider a real-world enterprise scenario. An employee asks: "What is our company's parental leave policy for contractors?""What is our company's parental leave policy for contractors?" The knowledge base contains parental leave policies for full-time employees - but nothing specific to contractors. A standard RAG pipeline retrieves the closest matching document and confidently presents full-time employee policy as the answer. This outcome is worse than returning no answer. The user trusts the system, acts on incorrect information, and the error may not surface until real consequences follow. This pattern is sometimes called hallucination laundering - the RAG architecture creates the appearance of factual grounding while the response is not actually supported by the retrieved evidence. Fixing this requires deliberate confidence checkpoints at each stage of the pipeline. Architecture Overview A standard RAG pipeline follows a simple path: User Query → Retrieve Documents → Generate Answer A confidence-aware pipeline adds two explicit decision checkpoints: Each layer catches failures the previous one may miss. Together, they form a defense-in-depth approach to output reliability. Strategy 1: Retrieval Confidence Scoring The first checkpoint evaluates whether retrieved documents are genuinely relevant before passing them to the LLM. Azure AI Search returns a @search.rerankerScore when semantic ranking is enabled - a value on the 0-4 scale that reflects how well each document matches the query intent, not just keyword overlap. from azure.search.documents import SearchClient from azure.identity import DefaultAzureCredential search_client = SearchClient( endpoint=AZURE_SEARCH_ENDPOINT, index_name="enterprise-knowledge-base", credential=DefaultAzureCredential() ) def retrieve_with_confidence(query: str, threshold: float = 1.5, top_k: int = 5): results = search_client.search( search_text=query, query_type="semantic", semantic_configuration_name="default", top=top_k, select=["content", "title", "source"] ) confident_results = [] for result in results: reranker_score = result.get("@search.rerankerScore", 0) if reranker_score >= threshold: confident_results.append({ "content": result["content"], "title": result["title"], "source": result["source"], "score": reranker_score }) return confident_results If no documents clear the threshold, the pipeline abstains rather than forcing a low-quality answer: results = retrieve_with_confidence(user_query, threshold=1.5) if not results: return { "answer": ( "I don't have enough information in the knowledge base to answer " "this question. Please contact the relevant team for assistance." ), "status": "abstained_retrieval" } Threshold tuning: Start at 1.5 on the 0-4 scale. Evaluate against a labeled test set and adjust based on your precision/recall requirements. Higher thresholds reduce false positives but may increase abstention on edge cases. Strategy 2: Citation Validation Even when retrieval scores are high, the LLM may synthesize information that does not exist in the retrieved context. Citation validation addresses this by requiring the model to ground every factual claim in a specific named source - and then programmatically verifying those citations exist in the retrieved set. from openai import AzureOpenAI client = AzureOpenAI( api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version="2025-12-01-preview" ) ANSWER_WITH_CITATIONS_PROMPT = """ You are an enterprise assistant. Answer the question using ONLY the provided context. RULES: 1. Every factual claim MUST include a citation in the format [Source: <title>]. 2. If the context does not contain enough information, respond with: "I don't have sufficient information to answer this question." 3. Do NOT infer, assume, or use knowledge outside the provided context. 4. If context partially answers the question, state what you know and explicitly note what information is missing. Context: {context} Question: {question} Answer: """ def generate_answer(question: str, context: str, sources: list) -> dict: prompt = ANSWER_WITH_CITATIONS_PROMPT.format( context=context, question=question ) response = client.chat.completions.create( model=AZURE_DEPLOYMENT_NAME, messages=[{"role": "user", "content": prompt}], temperature=0 ) answer = response.choices[0].message.content.strip() validation = validate_citations(answer, sources) return {"answer": answer, "citation_check": validation} The validation function checks that every citation in the answer maps to a document that was actually retrieved: import re def validate_citations(answer: str, sources: list) -> dict: cited = re.findall(r'\[Source:\s*(.+?)\]', answer) source_titles = {s["title"].lower().strip() for s in sources} valid, invalid = [], [] for citation in cited: if citation.lower().strip() in source_titles: valid.append(citation) else: invalid.append(citation) return { "total_citations": len(cited), "valid": valid, "invalid": invalid, "is_trustworthy": len(invalid) == 0 and len(cited) > 0 } If is_trustworthy is False, the pipeline flags the response for review or suppresses it: if not generation["citation_check"]["is_trustworthy"]: return { "answer": "I found related information but cannot provide a reliable answer based on the available sources.", "status": "abstained_citation" } Strategy 3: LLM-Based Abstention Scoring The third layer adds a second LLM call that acts as a quality judge - explicitly evaluating whether the generated answer is well-supported by the retrieved context, independent of citation formatting. ABSTENTION_JUDGE_PROMPT = """ You are an answer quality judge. Given a question, retrieved context, and a generated answer, evaluate whether the answer is fully supported by the context. Respond ONLY in JSON format: {{ "verdict": "supported" | "partial" | "unsupported", "confidence": <float between 0.0 and 1.0>, "reasoning": "<brief explanation>" }} Question: {question} Context: {context} Answer: {answer} """ def judge_answer(question: str, context: str, answer: str) -> dict: import json prompt = ABSTENTION_JUDGE_PROMPT.format( question=question, context=context, answer=answer ) response = client.chat.completions.create( model=AZURE_DEPLOYMENT_NAME, messages=[{"role": "user", "content": prompt}], temperature=0 ) return json.loads(response.choices[0].message.content.strip()) Integrate the judge with a confidence threshold of 0.6: judgement = judge_answer(user_query, context, generation["answer"]) if judgement["verdict"] == "unsupported" or judgement["confidence"] < 0.6: return { "answer": "I don't have sufficient information to answer this question confidently.", "status": "abstained_judge" } if judgement["verdict"] == "partial": generation["answer"] += ( "\n\nNote: This answer may be incomplete. " "Some aspects of your question were not covered in the available documents." ) End-to-End Pipeline Combining all three strategies gives a complete confidence-aware pipeline: def confidence_aware_rag(user_query: str) -> dict: # Layer 1: Retrieve with confidence gating results = retrieve_with_confidence(user_query, threshold=1.5) if not results: return { "answer": "I don't have enough information in the knowledge base to answer this.", "status": "abstained_retrieval" } context = "\n\n".join(r["content"] for r in results) # Layer 2: Generate with citation requirements generation = generate_answer(user_query, context, results) if not generation["citation_check"]["is_trustworthy"]: return { "answer": "I found related information but cannot provide a reliable answer.", "status": "abstained_citation" } # Layer 3: Judge the answer judgement = judge_answer(user_query, context, generation["answer"]) if judgement["verdict"] == "unsupported" or judgement["confidence"] < 0.6: return { "answer": "I don't have sufficient information to answer this question confidently.", "status": "abstained_judge" } if judgement["verdict"] == "partial": generation["answer"] += ( "\n\nNote: This answer may be incomplete. " "Some aspects of your question were not covered in available documents." ) return { "answer": generation["answer"], "status": "answered", "confidence": judgement["confidence"], "sources": [r["source"] for r in results[:3]] }def confidence_aware_rag(user_query: str) -> dict: # Layer 1: Retrieve with confidence gating results = retrieve_with_confidence(user_query, threshold=1.5) if not results: return { "answer": "I don't have enough information in the knowledge base to answer this.", "status": "abstained_retrieval" } context = "\n\n".join(r["content"] for r in results) # Layer 2: Generate with citation requirements generation = generate_answer(user_query, context, results) if not generation["citation_check"]["is_trustworthy"]: return { "answer": "I found related information but cannot provide a reliable answer.", "status": "abstained_citation" } # Layer 3: Judge the answer judgement = judge_answer(user_query, context, generation["answer"]) if judgement["verdict"] == "unsupported" or judgement["confidence"] < 0.6: return { "answer": "I don't have sufficient information to answer this question confidently.", "status": "abstained_judge" } if judgement["verdict"] == "partial": generation["answer"] += ( "\n\nNote: This answer may be incomplete. " "Some aspects of your question were not covered in available documents." ) return { "answer": generation["answer"], "status": "answered", "confidence": judgement["confidence"], "sources": [r["source"] for r in results[:3]] } Choosing the Right Strategies for Your Use Case Each strategy adds a layer of safety at a different cost. The right combination depends on the stakes involved in your deployment. Strategy Added Cost Latency Best For Retrieval Confidence Scoring None (uses existing search scores) None All RAG applications - this should be universal Citation Validation Minimal (regex post-processing) Negligible Regulated industries, compliance, audit trails LLM Abstention Judge One additional LLM call +1-3 seconds High-stakes decisions - financial, legal, medical For most enterprise applications, combining retrieval scoring and citation validation provides a strong baseline with minimal overhead. The judge layer is most valuable when incorrect answers carry significant business or compliance risk. Threshold calibration There is a meaningful tradeoff in threshold selection. Setting thresholds too high reduces hallucination but increases abstention - the system may refuse to answer even when reliable information is available. The recommended approach is to build a labeled evaluation set of query/answer pairs, run the pipeline at multiple threshold values, and select the point that meets your precision/recall requirements for the specific domain. When to Apply This Pattern Confidence-aware RAG is most valuable in deployments where: Data coverage is uneven - the knowledge base may have detailed coverage in some areas and gaps in others, making it difficult to predict when retrieval will be reliable Errors carry downstream consequences - healthcare documentation, legal and compliance search, financial reporting, and regulated industries where a wrong answer is worse than no answer Users have varying expertise - non-expert users may not recognize a plausible-sounding but incorrect response, making transparent uncertainty signals especially important Audit or traceability requirements apply - the ability to trace each answer back to a specific source with a confidence signal supports governance and review workflows Conclusion Building a RAG system that retrieves documents and generates responses is relatively straightforward. Building one that understands the limits of its own knowledge requires deliberate design. The three strategies covered here - retrieval confidence scoring, citation validation, and LLM-based abstention - form a layered defense against the most common failure mode in production RAG systems: the confident, well-formatted, completely unreliable answer. The most dangerous AI system is not one that fails openly. It is one that fails silently, with confidence. Teaching your pipeline to say "I don't know" is not a limitation. It is a feature that builds user trust and makes enterprise AI adoption sustainable over time.
RohitPoddar
May 19, 2026 Place Microsoft Developer Community Blog
281Views
0likes
0Comments
Turn Your App Service Web App Into a Self-Healing Agent: LLMOps Best Practices for Production
A user submits a prompt. The agent burns through 50,000 tokens looping on a malformed tool response. Another user trips a model rate limit and the agent silently fails. A bad prompt update goes out at 4 PM Friday and degrades success rate to 60%. Your APM dashboard shows green the entire time because none of that is a 500. This post walks through the LLMOps stack we built into a working reference sample on Azure App Service: the SLIs that matter for agents, a budget circuit breaker, prompt-repair retries, and a fully automated slot-swap rollback when things go sideways. Every code snippet is from the deployable sample at the end of the post. 📦 Sample repo: seligj95/app-service-self-healing-agent-python — azd up and you've got the whole stack live in your subscription in under 10 minutes. Why agent ops ≠ web-app SRE Your web app's reliability model assumes a request maps to bounded work — a SQL query, a cache hit, a templated response. You alert on Http5xx, p95 latency, and dependency failures. Done. An agent breaks that model in four ways: Cost is unbounded per request. An agent that loops on a flaky tool can spend $5 on one user prompt. The HTTP response is still 200. Failure can be silent. A model can hallucinate confident JSON, a tool can return malformed args, and the agent dutifully returns a wrong answer to the user. Zero exceptions logged. Latency is non-deterministic. A "simple" prompt that normally finishes in 2 seconds can blow out to 30s when the model picks an expensive plan. p95 latency tells you nothing. Quality regresses on prompt changes, not code changes. A prompt tweak that ships in seconds can crater tool-call accuracy by 30%. Your CI/CD pipeline didn't catch it because there were no failing tests. Web-app SLOs (uptime, latency, error rate) are necessary but not sufficient. Agents need agent-shaped SLOs. Define your agent SLOs first Before instrumenting anything, write down what "healthy" means. Here are the four SLIs we chose for the sample. None of them are Http5xx. SLI What it measures Why it matters Task success rate % of /chat requests that the agent self-classifies as completed Catches silent failures the HTTP layer misses Cost per task $ spent (input + output tokens × model rate) per /chat The unbounded-loop problem in one number Tool success rate % of tool invocations that didn't raise Tool layer is where most agent failures live Repair retries Times we re-prompted the model after a schema-validation failure Leading indicator of prompt drift In our reference middleware these come out as agent.task.success , agent.cost.usd , agent.tool.success , and agent.repair.retry — eleven custom metrics in total. We emit them via OpenTelemetry so they land in App Insights customMetrics and the included KQL workbook visualizes them as SLO tiles. Observability stack on App Service App Service makes the observability story unusually easy because you get App Insights wired up automatically by azd — no agent install, no DaemonSet, no sidecar. The only thing you bring is the SDK init for your custom metrics: # llmops_middleware/sli.py from azure.monitor.opentelemetry import configure_azure_monitor from opentelemetry import metrics def configure_azure_monitor_if_available() -> bool: if not os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING"): return False configure_azure_monitor() return True meter = metrics.get_meter("agent") tokens_in = meter.create_counter("agent.tokens.in") cost_usd = meter.create_counter("agent.cost.usd") task_latency = meter.create_histogram("agent.task.latency") tool_success = meter.create_counter("agent.tool.success") # ... We compute cost from a per-model rate card so the metric is in real dollars, not abstract tokens: COST_PER_1K_TOKENS = { "gpt-4o": {"in": 0.0025, "out": 0.01}, "gpt-4o-mini": {"in": 0.00015, "out": 0.0006}, } def record_cost(model: str, tokens_in_count: int, tokens_out_count: int, tenant: str) -> float: rate = COST_PER_1K_TOKENS[model] cost = (tokens_in_count * rate["in"] + tokens_out_count * rate["out"]) / 1000 cost_usd.add(cost, {"model": model, "tenant": tenant}) return cost Once those flow, the KQL queries write themselves: // Top cost-burning tenants in the last hour customMetrics | where timestamp > ago(1h) | where name == "agent.cost.usd" | extend tenant = tostring(customDimensions["tenant"]) | summarize spend_usd = sum(valueSum) by tenant | top 10 by spend_usd desc The sample ships a 6-tile workbook ( observability/workbook.json ) deployed via Bicep. It renders SLO compliance, cost burn-down, tool failure breakdown, latency percentiles, budget breaches, and healing signals out of the box. The deployed workbook in App Insights. The SLO panel dips during a chaos run and recovers as the agent self-heals — exactly the signal you want on a glass-pane dashboard. Cost guardrails with a budget circuit breaker Custom metrics tell you about cost after you spent it. To prevent runaways, you need a circuit breaker that bites before the model call happens. The middleware in llmops_middleware/budget.py keeps a per-tenant counter in memory (per month) and returns a decision: class BudgetDecision(Enum): ALLOW = "allow" # under budget DOWNSHIFT = "downshift" # ≥80% — switch to cheaper model BLOCK = "block" # ≥100% — refuse the request def evaluate(tenant: str) -> BudgetDecision: spent = _spend.get((tenant, _current_period()), 0.0) if spent >= BUDGET_USD_PER_TENANT: return BudgetDecision.BLOCK if spent >= BUDGET_USD_PER_TENANT * 0.80: return BudgetDecision.DOWNSHIFT return BudgetDecision.ALLOW The agent loop reads that decision and downshifts from gpt-4o to gpt-4o-mini — a 16× cost reduction ($0.0025 / 1K input tokens vs $0.00015) — when a tenant crosses 80% of their monthly budget. The user keeps getting answers; the bill stops climbing. def _pick_model(tenant: str) -> str: decision = budget.evaluate(tenant) if decision == BudgetDecision.DOWNSHIFT: sli.model_downshift.add(1, {"tenant": tenant}) return DOWNSHIFT_MODEL return PRIMARY_MODEL For the demo we keep state in memory; production should swap the dict for Redis (atomic INCRBY ) or Cosmos with optimistic concurrency. The interface in budget.py is intentionally tiny so this is a 10-line change. Self-healing patterns There are three patterns in the sample, each addressing a different failure class. 1. Retry with prompt-repair The most common agent failure isn't a tool exception — it's the model returning malformed JSON that fails schema validation on tool args. The fix is to feed the validation error back into the model and ask it to repair the call: # llmops_middleware/repair.py async def retry_with_repair(call_fn, args, *, max_attempts=2): for attempt in range(max_attempts): try: return await call_fn(args) except (ValidationError, RepairableError) as exc: sli.repair_retry.add(1, {"attempt": str(attempt)}) args = await _ask_model_to_repair(args, str(exc)) raise This single pattern recovers 50–70% of "the agent returned garbage" cases without escalating. 2. Tool fallback chains When a primary tool times out or fails open, try a cheaper or simpler one: async def tool_fallback_chain(primary, *fallbacks, args): for fn in (primary, *fallbacks): try: return await fn(args) except ToolUnavailable: sli.tool_success.add(1, {"tool": fn.__name__, "status": "fallback"}) raise NoToolAvailable() Lookup-style tools especially benefit: web search → cached snapshot → static knowledge base. 3. Slot-swap auto-rollback Here's the killer feature App Service brings that's a slog on K8s: deployment slots. You always have a known-good previous version warmed up and one ARM API call away from production traffic. We wire that up to fire automatically when our SLI breaches. The chain is: Metric alert on Http5xx > 5 in 5 minutes (the platform metric, free) Action Group that POSTs to a Logic App webhook (SAS-signed callback URL) Logic App that calls POST /sites/{name}/slots/staging/slotsswap via its managed identity (granted Website Contributor on the target web app) The whole healer is one trigger + two actions: receive the alert webhook, call ARM slotsswap, return a status payload to the caller. The two actions in Bicep: SwapSlots: { type: 'Http' inputs: { method: 'POST' uri: '${environment().resourceManager}@{parameters(\'targetSiteId\')}/slots/staging/slotsswap?api-version=2024-04-01' body: { targetSlot: 'production' } authentication: { type: 'ManagedServiceIdentity' audience: environment().resourceManager } } } No code to deploy, no secrets to manage, no second runtime to babysit. From alert-fire to swapped-slot is about 4 minutes in our tests — under the SLA most agent products have for "user-visible degraded mode." Why not a Function App? We started there. The Logic App is 60 lines of Bicep and zero application code. For a one-action workflow like "swap a slot," the Function adds packaging, deployment, and a runtime to monitor for no benefit. Chaos testing for agents You can't trust a self-healing system you haven't broken. The sample ships a chaos CLI and an in-process injection point so you can practice failures on demand. In-process: llmops_middleware/chaos.py exposes four modes ( off , throttle , malformed , outage ) togglable via POST /admin/chaos . When set, tool calls roll a die and raise the matching exception with the configured probability: class ChaosController: def maybe_inject(self) -> None: if random.random() > self.probability: return if self.mode == "outage": raise ChaosOutage("simulated tool outage") if self.mode == "throttle": raise ChaosThrottled("simulated 429") if self.mode == "malformed": raise ChaosMalformed("simulated bad tool output") External: chaos/inject.py is a small async load driver that sets /admin/chaos then drives /chat at a target RPS, tallying response codes: python chaos/inject.py \ --base-url https://my-agent.azurewebsites.net \ --mode outage --probability 1.0 --rps 10 --duration 300 Running that for 5 minutes against the deployed sample reliably: Drives customMetrics(name="agent.task.failure") over 50/min Trips the Http5xx > 5 metric alert (~90 seconds after threshold breach) Fires the Logic App run (succeeded in 1.2 seconds in our test) Flips the slot — /health instance ID changes The repo's observability/queries.kql has the canonical KQL for each of these signals, and observability/workbook.json is the deployable workbook that visualizes them. The reference middleware Everything in this post is in seligj95/app-service-self-healing-agent-python. The Python package llmops_middleware/ is the part you'd vendor into your own agent — sli.py , budget.py , repair.py , chaos.py . The agent loop and the Bicep are demo-quality but production-shaped. Run it yourself: git clone https://github.com/seligj95/app-service-self-healing-agent-python cd app-service-self-healing-agent-python azd auth login azd up You'll have an agent + AOAI + workbook + healer running in about 8 minutes. Then run the chaos script and watch the slot flip. The KQL workbook Deployable workbook JSON, dropped into the resource group by Bicep. Six panels: SLO tile — % of tasks where agent.task.success was emitted (grouped by tenant) Cost burn-down — running spend per tenant against the monthly budget Top failing tools — failure count by tool, broken down by error class Latency p50/p95/p99 — agent.task.latency histogram Budget breaches — count and tenant list Healing signals — agent.repair.retry + agent.model.downshift + agent.chaos.injected over time It's observability/workbook.json — loadTextContent -ed into infra/shared/monitoring.bicep so you get it deployed automatically. Why App Service for LLMOps After building this, the appeal of App Service for agents is clearer than I expected going in: Slots are an unfair advantage. A pre-warmed previous version, one ARM call from production. K8s blue/green needs you to build it. Managed identity to Azure OpenAI removes the entire key-rotation problem. The sample sets disableLocalAuth: true on the AOAI account — there literally is no key. App Insights is auto-wired so your custom metrics land in customMetrics and your KQL queries work day one. Bicep + azd lets you ship a full LLMOps stack in one repo: app, infra, healing, observability, chaos. If you're standing up a new agent and you don't already have a Kubernetes platform you love, App Service is a strong default. Wrap-up If you take three things from this post: Define agent SLOs in your own terms — task success, cost per task, tool reliability — not just web-app SLOs. Put a circuit breaker between the user and the model. A budget breaker that downshifts to a cheaper model is the highest-ROI middleware you can ship. Make rollback boring. Slot swap + a one-action Logic App + a metric alert is a self-healing system you can build in an afternoon and trust at 3 AM. The sample has all of it wired up. We're considering baking these into App Service — tell us what you'd want The middleware in this sample (SLIs + telemetry, cost guardrails, policy/audit hooks) is exactly the kind of thing we're evaluating as first-class App Service platform features — opt-in sidecars or built-in capabilities so you don't have to vendor a middleware package into every agent you ship. Concretely, we're tracking ideas like: Agent Observatory — a sidecar that intercepts SDK calls (Semantic Kernel, LangChain, Crew AI, AutoGen) and captures full reasoning traces with zero code changes AI Cost Guardian — platform-level quotas and spend caps across Azure OpenAI, Anthropic, and other model providers, with real-time enforcement Policy Guard — governance, PII masking, model-approval lists, and an immutable audit log for regulated workloads If any of those would land for your team — or if you're solving these problems differently and want to push back on the shape — we want to hear it. Drop a comment on this post: the roadmap is genuinely shaped by feedback at this stage.
jordanselig
May 18, 2026 Place Apps on Azure Blog
188Views
0likes
0Comments
New SSH helper aliases for Python apps on Azure App Service for Linux
Troubleshooting a running application often starts with SSH. To make that experience simpler for Python apps on Azure App Service for Linux, we have added new SSH helper aliases for common app, log, networking, and Azure AI Foundry diagnostics. When you SSH into your application, you will now see two helper commands: View available SSH helpers with apphelp Run apphelp to see the full list of available aliases. These helpers are grouped by common tasks, including app information, logs, diagnostics, testing, and Azure AI Foundry connectivity. For example: applogs Tails your application logs directly from the SSH session. appcurl Tests your application locally using localhost:$PORT, which is useful when checking whether the app is listening correctly inside the container. Other useful helpers include: showpkgs # List installed Python packages appconfig # Show common App Service settings deploylogs # Show recent deployment logs checkport # Verify the app is listening on the configured port gohome # Go to /home/site/wwwroot gosrc # Go to the app source directory Azure AI Foundry diagnostics from SSH We have also added helpers for Azure AI Foundry scenarios. These are useful when your app calls Azure AI services and you need to quickly validate identity, DNS, connectivity, or response latency from inside the App Service environment. For a quick end-to-end connectivity test, run: ai-test Example output: ✓ Connected | 3706ms | Model: gpt-4.1-mini | Auth: Managed Identity For a broader diagnostic check, run: ai-diagnose Example output: AI Foundry Diagnostics [✓] Managed identity token [✓] DNS resolution [✓] Foundry connectivity Additional AI helpers include: ai-dns # Check DNS resolution for the Foundry endpoint ai-access-check # Check RBAC for Foundry calls ai-curl # Verbose HTTP debug for Foundry ai-latency # Benchmark Foundry response times Install networking tools with install-nettools For deeper connectivity troubleshooting, you can run: install-nettools This installs commonly used networking utilities that can help diagnose DNS resolution, TCP connectivity, routing, packet capture, listening ports, and HTTP endpoint access. Why this helps These aliases are intended to reduce the number of manual steps needed during troubleshooting. Instead of remembering log paths, port checks, curl commands, or AI connectivity validation steps, you can run a focused helper command directly from the SSH session. If there are other common SSH commands or troubleshooting workflows you would like us to add as aliases, please share your feedback with us.
TulikaC
May 15, 2026 Place Apps on Azure Blog
150Views
1like
0Comments