microsoft foundry sdk
3 TopicsFine-tuning at Ignite 2025: new models, new tools, new experience
Fine‑tuning isn’t just “better prompts.” It’s how you tailor a foundation model to your domain and tasks to get higher accuracy, lower cost, and faster responses -- then run it at scale. As Agents become more critical to businesses, we’re seeing growing demand for fine tuning to ensure agents are low latency, low cost, and call the right tools and the right time. At Ignite 2025, we saw how Docusign fine-tuned models that powered their document management system to achieve major gains: more than 50% cost reduction per document, 2x faster inference time, and significant improvements in accuracy. At Ignite, we launched several new features in Microsoft Foundry that make fine‑tuning easier, more scalable, and more impactful than ever with the goal of making agents unstoppable in the real world: New Open-Source models – Qwen3 32B, Ministral 3B, GPT-OSS-20B and Llama 3.3 70B – to give users access to Open-Source models in the same low friction experience as OpenAI Synthetic data generation to jump start your training journey – just upload your documents and our multi-agent system takes care of the rest Developer Training tier to reduce the barrier to entry by offering discounted training (50% off global!) on spot capacity Agentic Reinforcement Fine-tuning with GPT-5: leverage tool calling during chain of thought to teach reasoning models to use your tools to solve complex problems And if that wasn’t enough, we also released a re-imagined fine tuning experience in Foundry (new), providing access to all these capabilities in a simplified and unified UI. New Open-Source Models for Fine-tuning (Public Preview): Bringing open-source innovation to your fingertips We’ve expanded our model lineup to new open-source models you can fine-tune without worrying about GPUs or compute. Ministral-3B and Qwen3 32B are now available to fine-tune with Supervised Fine-Tuning (SFT) in Microsoft Foundry, enabling developers to adapt open-source models to their enterprise-specific domains with ease. Look out for Llama 3.3 70B and GPT-OSS-20B, coming next week! These OSS models are offered through a unified interface with OpenAI via the UI or Foundry SDK which means the same experience, regardless of model choice. These models can be used alongside your favorite Foundry tools, from AI Search to Evaluations, or to power your agents. Note: New OSS models are only available in "New" Foundry – so upgrade today! Like our OpenAI models, Open-Source models in Foundry charge per-token for training, making it simple to forecast and estimate your costs. All models are available on Global Standard tier, making discoverability easy. For more details on pricing, please see our Microsoft Foundry Models pricing page. Customers like Co-Star Group have already seen success leveraging fine tuning with Mistral models to power their home search experience on Homes.com. They selected Ministral-3B as a small, efficient model to power high volume, low latency processing with lower costs and faster deployment times than Frontier models – while still meeting their needs for accuracy, scalability, and availability thanks to fine tuning in Foundry. Synthetic data generation (Public Preview): Create high-quality training data automatically Developers can now generate high-quality, domain-specific synthetic datasets to close those persistent data gaps with synthetic data generation. One of the biggest challenges we hear teams face during fine-tuning is not having enough data or the right kind of data because it’s scarce, sensitive, or locked behind compliance constraints (think healthcare and finance). Our new synthetic data generation capability solves this by giving you a safe, scalable way to create realistic, diverse datasets tailored to your use case so you can fine-tune and evaluate models without waiting for perfect real-world data. Now, you can produce realistic question–answer pairs from your documents, or simulate multi‑turn tool‑use dialogues that include function calls without touching sensitive production data. How it works: Fine‑tuning datasets: Upload a reference file (PDF/Markdown/TXT) and Foundry converts it into SFT‑formatted Q&A pairs that reflect your domain’s language and nuances so your model learns from the right examples. Agent tool‑use datasets: Provide an OpenAPI (Swagger) spec, and Foundry simulates multi‑turn assistant–user conversations with tool calls, producing SFT‑ready examples that teach models to call your APIs reliably. Evaluation datasets: Generate distinct test queries tailored to your scenarios so you can measure model and agent quality objectively—separate from your training data to avoid false confidence. Agents succeed when they reliably understand domain intent and call the right tools at the right time. Foundry’s synthetic data generation does exactly that: it creates task‑specific training and test data so your agent learns from the right examples and you can prove it works before you go live so they are reliable in the real world. Developer Training Tier (Public Preview): 50% discount on training jobs Fine-tuning can be expensive, especially when you may need to run multiple experiments to create the right model for your production agents. To make it easier than ever to get started, we’re introducing Developer Training tier – providing users with a 50% discount when they choose to run workloads on pre-emptible capacity. It also lets users iterate faster: we support up to 10 concurrent jobs on Developer tier, making it ideal for running experiments in parallel. Because it uses reclaimable capacity, jobs may be pre‑empted and automatically resumed, so they may take longer to complete. When to use Developer Training tier: When cost matters - great for early experimentation or hyperparameter tuning thanks to 50% lower training cost. When you need high concurrency - supports up to 10 simultaneous jobs, ideal for running multiple experiments in parallel. When the workload is non‑urgent - suitable for jobs that can tolerate pre-emption and longer, capacity-dependent runtimes. Agentic Reinforcement Fine‑Tuning (RFT) (Private Preview): Train reasoning models to use your tools through outcome based optimization Building reliable AI agents requires more than copying correct behavior; models need to learn which reasoning paths lead to successful outcomes. While supervised fine-tuning trains models to imitate demonstrations, reinforcement fine-tuning optimizes models based on whether their chain of thought actually generates a successful outcome. It teaches them to think in new ways, about new domains – to solve complex problems. Agentic RFT applies this to tool-using workflows: the model generates multiple reasoning traces (including tool calls and planning steps), receives feedback on which attempts solved the problem correctly, and updates its reasoning patterns accordingly. This helps models learn effective strategies for tool sequencing, error recovery, and multi-step planning—behaviors that are difficult to capture through demonstrations alone. The difference now is that you can provide your own custom tools for use during chain of thought: models can interact with your own internal systems, retrieve the data they need, and access your proprietary APIs to solve your unique problems. Agentic RFT is currently available in private preview for o4-mini and GPT-5, with configurable reasoning effort, sampling rates, and per-run telemetry. Request access at aka.ms/agentic-rft-preview. What are customers saying? Fine-tuning is critical to achieve the accuracy and latency needed for enterprise agentic workloads. Decagon is used by many of the world’s most respected enterprises to build, manage and scale AI agents that can resolve millions of customer inquiries across chat, email, and voice – 24 hours a day, seven days a week. This experience is powered by fine-tuning: “Providing accurate responses with minimal latency is fundamental to Decagon’s product experience. We saw an opportunity to reduce latency while improving task-specific accuracy by fine-tuning models using our proprietary datasets. Via fine-tuning, we were able to exceed the performance of larger state of the art models with smaller, lighter-weight models which could be served significantly faster.” -- Cyrus Asgari, Lead Research Engineer for fine-tuning at Decagon But it’s not just agent-first startups seeing results. Companies like Discover Bank are using fine tuned models to provide better customer experiences with personal banking agents: We consolidated three steps into one, response times that were previously five or six seconds came down to one and a half to two seconds on average. This approach made the system more efficient and the 50% reduction in latency made conversations with Discovery AI feel seamless. - Stuart Emslie, Head of Actuarial and Data Science at Discovery Bank Fine-tuning has evolved from an optimization technique to essential infrastructure for production AI. Whether building specialized agents or enhancing existing products, the pattern is clear: custom-trained models deliver the accuracy and speed that general-purpose models can't match. As techniques like Agentic RFT and synthetic data generation mature, the question isn't whether to fine-tune, but how to build the systems to do it systematically. Learn More 🧠 Get Started with fine-tuning with Azure AI Foundry on Microsoft Learn Docs ▶️ Watch On-Demand: https://ignite.microsoft.com/en-US/sessions/BRK188?source=sessions 👩 Try the demos: aka.ms/FT-ignite-demos 👋 Continue the conversation on Discord147Views0likes0CommentsPublishing Agents from Microsoft Foundry to Microsoft 365 Copilot & Teams
Better Together is a series on how Microsoft’s AI platforms work seamlessly to build, deploy, and manage intelligent agents at enterprise scale. As organizations embrace AI across every workflow, Microsoft Foundry, Microsoft 365, Agent 365, and Microsoft Copilot Studio are coming together to deliver a unified approach—from development to deployment to day-to-day operations. This three-part series explores how these technologies connect to help enterprises build AI agents that are secure, governed, and deeply integrated with Microsoft’s product ecosystem. Series Overview Part 1: Publishing from Foundry to Microsoft 365 Copilot and Microsoft Teams Part 2: Foundry + Agent 365 — Native Integration for Enterprise AI Part 3: Microsoft Copilot Studio Integration with Foundry Agents This blog focuses on Part 1: Publishing from Foundry to Microsoft 365 Copilot—how developers can now publish agents built in Foundry directly to Microsoft 365 Copilot and Teams in just a few clicks. Build once. Publish everywhere. Developers can now take an AI agent built in Microsoft Foundry and publish it directly to Microsoft 365 Copilot and Microsoft Teams in just a few clicks. The new streamlined publishing flow eliminates manual setup across Entra ID, Azure Bot Service, and manifest files, turning hours of configuration into a seamless, guided flow in the Foundry Playground. Simplifying Agent Publishing for Microsoft 365 Copilot & Microsoft Teams Previously, deploying a Foundry AI agent into Microsoft 365 Copilot and Microsoft Teams required multiple steps: app registration, bot provisioning, manifest editing, and admin approval. With the new Foundry → M365 integration, the process is straightforward and intuitive. Key capabilities No-code publishing — Prepare, package, and publish agents directly from Foundry Playground. Unified build — A single agent package powers multiple Microsoft 365 channels, including Teams Chat, Microsoft 365 Copilot Chat, and BizChat. Agent-type agnostic — Works seamlessly whether you have a prompt agent, hosted agent, or workflow agent. Built-in Governance — Every agent published to your organization is automatically routed through Microsoft 365 Admin Center (MAC) for review, approval, and monitoring. Downloadable package — Developers can download a .zip for local testing or submission to the Microsoft Marketplace. For pro-code developers, the experience is also simplified. A C# code-first sample in the Agent Toolkit for Visual Studio is searchable, featured, and ready to use. Why It Matters This integration isn’t just about convenience; it’s about scale, control, and trust. Faster time to value — Deliver intelligent agents where people already work, without infrastructure overhead. Enterprise control — Admins retain full oversight via Microsoft 365 Admin Center, with built-in approval, review and governance flows. Developer flexibility — Both low-code creators and pro-code developers benefit from the unified publishing experience. Better Together — This capability lays the groundwork for Agent 365 publishing and deeper M365 integrations. Real-world scenarios YoungWilliams built Priya, an AI agent that helps handle government service inquiries faster and more efficiently. Using the one-click publishing flow, Priya was quickly deployed to Microsoft Teams and M365 Copilot without manual setup. This allowed Young Williams’ customers to provide faster, more accurate responses while keeping governance and compliance intact. “Integrating Microsoft Foundry with Microsoft 365 Copilot fundamentally changed how we deliver AI solutions to our government partners,” said John Tidwell, CTO of YoungWilliams. “With Foundry’s one-click publishing to Teams and Copilot, we can take an idea from prototype to production in days instead of weeks—while maintaining the enterprise-grade security and governance our clients expect. It’s a game changer for how public services can adopt AI responsibly and at scale.” Availability Publishing from Foundry to M365 is in Public Preview within the Foundry Playground. Developers can explore the preview in Microsoft Foundry and test the Teams / M365 publishing flow today. SDK and CLI extensions for code-first publishing are generally available. What’s Next in the Better Together Series This blog is part of the broader Better Together series connecting Microsoft Foundry, Microsoft 365, Agent 365, and Microsoft Copilot Studio. Continue the journey: Foundry + Agent 365 — Native Integration for Enterprise AI (Link) Start building today [Quickstart — Publish an Agent to Microsoft 365 ] Try it now in the new Foundry Playground1.2KViews0likes0CommentsHybrid AI Using Foundry Local, Microsoft Foundry and the Agent Framework - Part 1
Hybrid AI is quickly becoming one of the most practical architectures for real-world applications—especially when privacy, compliance, or sensitive data handling matter. Today, it’s increasingly common for users to have capable GPUs in their laptops or desktops, and the ecosystem of small, efficient open-source language models has grown dramatically. That makes local inference not only possible, but easy. In this guide, we explore how a locally run agent built with the Agent Framework can combine the strengths of cloud models in Azure AI Foundry with a local LLM running on your own GPU through Foundry Local. This pattern allows you to use powerful cloud reasoning without ever sending raw sensitive data—like medical labs, legal documents, or financial statements—off the device. Part 1 focuses on the foundations of this architecture, using a simple illustrative example to show how local and cloud inference can work together seamlessly under a single agent. Disclaimer: The diagnostic results, symptom checker, and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment. Demonstrating the concept Problem Statement We’ve all done it: something feels off, we get a strange symptom, or a lab report pops into our inbox—and before thinking twice, we copy-paste way too much personal information into whatever website or chatbot seems helpful at the moment. Names, dates of birth, addresses, lab values, clinic details… all shared out of habit, usually because we just want answers quickly. This guide uses a simple, illustrative scenario—a symptom checker with lab report summarization—to show how hybrid AI can help reduce that oversharing. It’s not a medical product or a clinical solution, but it’s a great way to understand the pattern. With Microsoft Foundry, Foundry Local, and the Agent Framework, we can build workflows where sensitive data stays on the user’s machine and is processed locally, while the cloud handles the heavier reasoning. Only a safe, structured summary ever leaves the device. The Agent Framework handles when to use the local model vs. the cloud model, giving us a seamless and privacy-preserving hybrid experience. Demo scenario This demo uses a simple, illustrative symptom-checker to show how hybrid AI keeps sensitive data private while still benefiting from powerful cloud reasoning. It’s not a medical product—just an easy way to demonstrate the pattern: Here’s what happens: A Python agent (Agent Framework) runs locally and can call both cloud models and local tools. Azure AI Foundry (GPT-4o) handles reasoning and triage logic but never sees raw PHI. Foundry Local runs a small LLM (phi-4-mini) on your GPU and processes the raw lab report entirely on-device. A tool function (@ai_function) lets the agent call the local model automatically when it detects lab-like text. The flow is simple: user_message = symptoms + raw lab text agent → calls local tool → local LLM returns JSON cloud LLM → uses JSON to produce guidance Environment setup Foundry Local Service On the local machine with GPU, let's install Foundry local using: PS C: \Windows\system32> winget install Microsoft.FoundryLocal Then let's download our local model, in this case phi-4-mini and test it: PS C:\Windows\system32> foundry model download phi-4-mini Downloading Phi-4-mini-instruct-cuda-gpu:5... [################### ] 53.59 % [Time remaining: about 4m] 5.9 MB/s/s PS C:\Windows\system32> foundry model load phi-4-mini 🕗 Loading model... 🟢 Model phi-4-mini loaded successfully PS C:\Windows\system32> foundry model run phi-4-mini Model Phi-4-mini-instruct-cuda-gpu:5 was found in the local cache. Interactive Chat. Enter /? or /help for help. Press Ctrl+C to cancel generation. Type /exit to leave the chat. Interactive mode, please enter your prompt > Hello can you let me know who you are and which model you are using 🧠 Thinking... 🤖 Hello! I'm Phi, an AI developed by Microsoft. I'm here to help you with any questions or tasks you have. How can I assist you today? > PS C:\Windows\system32> foundry service status 🟢 Model management service is running on http://127.0.0.1:52403/openai/status Now we see the model is accessible with API on the localhost with port 52403. Foundry Local models don’t always use simple names like "phi-4-mini". Each installed model has a specific Model ID that Foundry Local assigns (for example: Phi-4-mini-instruct-cuda-gpu:5 in this case). We now can use the Model ID for a quick test: from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:52403/v1", api_key="ignored") resp = client.chat.completions.create( model="Phi-4-mini-instruct-cuda-gpu:5", messages=[{"role": "user", "content": "Say hello"}]) Returned 200 OK. Microsoft Foundry To handle the cloud part of the hybrid workflow, we start by creating a Microsoft AI Foundry project. This gives us an easy, managed way to use models like GPT-4o-mini —no deployment steps, no servers to configure. You simply point the Agent Framework at your project, authenticate, and you’re ready to call the model. A nice benefit is that Microsoft Foundry and Foundry Local share the same style of API. Whether you call a model in the cloud or on your own machine, the request looks almost identical. This consistency makes hybrid development much easier: the agent doesn’t need different logic for local vs. cloud models—it just switches between them when needed. Under the Hood of Our Hybrid AI Workflow Agent Framework For the agent code, I am using the Agent Framework libraries, and I am giving specific instructions to the agent as per below: from agent_framework import ChatAgent, ai_function from agent_framework.azure import AzureAIAgentClient from azure.identity.aio import AzureCliCredential # ========= Cloud Symptom Checker Instructions ========= SYMPTOM_CHECKER_INSTRUCTIONS = """ You are a careful symptom-checker assistant for non-emergency triage. General behavior: - You are NOT a clinician. Do NOT provide medical diagnosis or prescribe treatment. - First, check for red-flag symptoms (e.g., chest pain, trouble breathing, severe bleeding, stroke signs, one-sided weakness, confusion, fainting). If any are present, advise urgent/emergency care and STOP. - If no red-flags, summarize key factors (age group, duration, severity), then provide: 1) sensible next steps a layperson could take, 2) clear guidance on when to contact a clinician, 3) simple self-care advice if appropriate. - Use plain language, under 8 bullets total. - Always end with: "This is not medical advice." Tool usage: - When the user provides raw lab report text, or mentions “labs below” or “see labs”, you MUST call the `summarize_lab_report` tool to convert the labs into structured data before giving your triage guidance. - Use the tool result as context, but do NOT expose the raw JSON directly. Instead, summarize the key abnormal findings in plain language. """.strip() Referencing the local model Now I am providing a system prompt for the locally inferred model to transform the lab result text into a JSON object with lab results only: # ========= Local Lab Summarizer (Foundry Local + Phi-4-mini) ========= FOUNDRY_LOCAL_BASE = "http://127.0.0.1:52403" # from `foundry service status` FOUNDRY_LOCAL_CHAT_URL = FOUNDRY_LOCAL_BASE + "/v1/chat/completions" # This is the model id you confirmed works: FOUNDRY_LOCAL_MODEL_ID = "Phi-4-mini-instruct-cuda-gpu:5" LOCAL_LAB_SYSTEM_PROMPT = """ You are a medical lab report summarizer running locally on the user's machine. You MUST respond with ONLY one valid JSON object. Do not include any explanation, backticks, markdown, or text outside the JSON. The JSON must have this shape: { "overall_assessment": "<short plain English summary>", "notable_abnormal_results": [ { "test": "string", "value": "string", "unit": "string or null", "reference_range": "string or null", "severity": "mild|moderate|severe" } ] } If you are unsure about a field, use null. Do NOT invent values. """.strip() Agent Framework tool In this next step, we wrap the local Foundry inference inside an Agent Framework tool using the AI_function decorator. This abstraction is more than styler—it is the recommended best practice for hybrid architectures. By exposing local GPU inference as a tool, the cloud-hosted agent can decide when to call it, pass structured arguments, and consume the returned JSON seamlessly. It also ensures that the raw lab text (which may contain PII) stays strictly within the local function boundary, never entering the cloud conversation. Using a tool in this way provides a consistent, declarative interface, enables automatic reasoning and tool-routing by frontier models, and keeps the entire hybrid workflow maintainable, testable, and secure: @ai_function( name="summarize_lab_report", description=( "Summarize a raw lab report into structured abnormalities using a local model " "running on the user's GPU. Use this whenever the user provides lab results as text." ), ) def summarize_lab_report( lab_text: Annotated[str, Field(description="The raw text of the lab report to summarize.")], ) -> Dict[str, Any]: """ Tool: summarize a lab report using Foundry Local (Phi-4-mini) on the user's GPU. Returns a JSON-compatible dict with: - overall_assessment: short text summary - notable_abnormal_results: list of abnormal test objects """ payload = { "model": FOUNDRY_LOCAL_MODEL_ID, "messages": [ {"role": "system", "content": LOCAL_LAB_SYSTEM_PROMPT}, {"role": "user", "content": lab_text}, ], "max_tokens": 256, "temperature": 0.2, } headers = { "Content-Type": "application/json", } print(f"[LOCAL TOOL] POST {FOUNDRY_LOCAL_CHAT_URL}") resp = requests.post( FOUNDRY_LOCAL_CHAT_URL, headers=headers, data=json.dumps(payload), timeout=120, ) resp.raise_for_status() data = resp.json() # OpenAI-compatible shape: choices[0].message.content content = data["choices"][0]["message"]["content"] # Handle string vs list-of-parts if isinstance(content, list): content_text = "".join( part.get("text", "") for part in content if isinstance(part, dict) ) else: content_text = content print("[LOCAL TOOL] Raw content from model:") print(content_text) # Strip ```json fences if present, then parse JSON cleaned = _strip_code_fences(content_text) lab_summary = json.loads(cleaned) print("[LOCAL TOOL] Parsed lab summary JSON:") print(json.dumps(lab_summary, indent=2)) # Return dict – Agent Framework will serialize this as the tool result return lab_summary The case, labs and prompt All patient and provider information in below example is entirely fictitious and used for illustrative purposes only. To illustrate the pattern, this sample prepares the “case” in code: it combines a symptom description with a lab report string and then submits that prompt to the agent. In production, these inputs would be captured from a UI or API. # Example free-text case + raw lab text that the agent can decide to send to the tool case = ( "Teenager with bad headache and throwing up. Fever of 40C and no other symptoms." ) lab_report_text = """ ------------------------------------------- AI Land FAMILY LABORATORY SERVICES 4420 Camino Del Foundry, Suite 210 Gpuville, CA 92108 Phone: (123) 555-4821 | Fax: (123) 555-4822 ------------------------------------------- PATIENT INFORMATION Name: Frontier Model DOB: 04/12/2007 (17 yrs) Sex: Male Patient ID: AXT-442871 Address: 1921 MCP Court, CA 01100 ORDERING PROVIDER Dr. Bot, MD NPI: 1780952216 Clinic: Phi Pediatrics Group REPORT DETAILS Accession #: 24-SDFLS-118392 Collected: 11/14/2025 14:32 Received: 11/14/2025 16:06 Reported: 11/14/2025 20:54 Specimen: Whole Blood (EDTA), Serum Separator Tube ------------------------------------------------------ COMPLETE BLOOD COUNT (CBC) ------------------------------------------------------ WBC ................. 14.5 x10^3/µL (4.0 – 10.0) HIGH RBC ................. 4.61 x10^6/µL (4.50 – 5.90) Hemoglobin .......... 13.2 g/dL (13.0 – 17.5) LOW-NORMAL Hematocrit .......... 39.8 % (40.0 – 52.0) LOW MCV ................. 86.4 fL (80 – 100) Platelets ........... 210 x10^3/µL (150 – 400) ------------------------------------------------------ INFLAMMATORY MARKERS ------------------------------------------------------ C-Reactive Protein (CRP) ......... 60 mg/L (< 5 mg/L) HIGH Erythrocyte Sedimentation Rate ... 32 mm/hr (0 – 15 mm/hr) HIGH ------------------------------------------------------ BASIC METABOLIC PANEL (BMP) ------------------------------------------------------ Sodium (Na) .............. 138 mmol/L (135 – 145) Potassium (K) ............ 3.9 mmol/L (3.5 – 5.1) Chloride (Cl) ............ 102 mmol/L (98 – 107) CO2 (Bicarbonate) ........ 23 mmol/L (22 – 29) Blood Urea Nitrogen (BUN) 11 mg/dL (7 – 20) Creatinine ................ 0.74 mg/dL (0.50 – 1.00) Glucose (fasting) ......... 109 mg/dL (70 – 99) HIGH ------------------------------------------------------ LIVER FUNCTION TESTS ------------------------------------------------------ AST ....................... 28 U/L (0 – 40) ALT ....................... 22 U/L (0 – 44) Alkaline Phosphatase ...... 144 U/L (65 – 260) Total Bilirubin ........... 0.6 mg/dL (0.1 – 1.2) ------------------------------------------------------ NOTES ------------------------------------------------------ Mild leukocytosis and elevated inflammatory markers (CRP, ESR) may indicate an acute infectious or inflammatory process. Glucose slightly elevated; could be non-fasting. ------------------------------------------------------ END OF REPORT SDFLS-CLIA ID: 05D5554973 This report is for informational purposes only and not a diagnosis. ------------------------------------------------------ """ # Single user message that gives both the case and labs. # The agent will see that there are labs and call summarize_lab_report() as a tool. user_message = ( "Patient case:\n" f"{case}\n\n" "Here are the lab results as raw text. If helpful, you can summarize them first:\n" f"{lab_report_text}\n\n" "Please provide non-emergency triage guidance." ) The Hybrid Agent code Here’s where the hybrid behavior actually comes together. By this point, we’ve defined a local tool that talks to Foundry Local and configured access to a cloud model in Azure AI Foundry. In the main() function, the Agent Framework ties these pieces into a single workflow. The agent runs locally, receives a message containing both symptoms and a raw lab report, and decides when to call the local tool. The lab report is summarized on your GPU, and only the structured JSON is passed to the cloud model for reasoning. The snippet below shows how we attach the tool to the agent and trigger both local inference and cloud guidance within one natural-language prompt # ========= Hybrid Main (Agent uses the local tool) ========= async def main(): ... async with ( AzureCliCredential() as credential, ChatAgent( chat_client=AzureAIAgentClient(async_credential=credential), instructions=SYMPTOM_CHECKER_INSTRUCTIONS, # 👇 Tool is now attached to the agent tools=[summarize_lab_report], name="hybrid-symptom-checker", ) as agent, ): result = await agent.run(user_message) print("\n=== Symptom Checker (Hybrid: Local Tool + Cloud Agent) ===\n") print(result.text) if __name__ == "__main__": asyncio.run(main()) Testing the Hybrid Agent Now I am running the agent code from VSCode and can see the local inference happening when lab was submitted. Then results are formatted, PII omitted and the GPT-40 model can process the symptom along the results What's next In this example, the agent runs locally and pulls in both cloud and local inference. In Part 2, we’ll explore the opposite architecture: a cloud-hosted agent that can safely call back into a local LLM through a secure gateway. This opens the door to more advanced hybrid patterns where tools running on edge devices, desktops, or on-prem systems can participate in cloud-driven workflows without exposing sensitive data. References Agent Framework: https://github.com/microsoft/agent-framework Repo for the code available here:850Views2likes0Comments