microsoft foundry
21 TopicsBeyond the Model: Empower your AI with Data Grounding and Model Training
Discover how Microsoft Foundry goes beyond foundational models to deliver enterprise-grade AI solutions. Learn how data grounding, model tuning, and agentic orchestration unlock faster time-to-value, improved accuracy, and scalable workflows across industries.461Views5likes3CommentsOpen AI’s GPT-5.1-codex-max in Microsoft Foundry: Igniting a New Era for Enterprise Developers
Announcing GPT-5.1-codex-max: The Future of Enterprise Coding Starts Now We’re thrilled to announce the general availability of OpenAI's GPT-5.1-codex-max in Microsoft Foundry Models; a leap forward that redefines what’s possible for enterprise-grade coding agents. This isn’t just another model release; it’s a celebration of innovation, partnership, and the relentless pursuit of developer empowerment. At Microsoft Ignite, we unveiled Microsoft Foundry: a unified platform where businesses can confidently choose the right model for every job, backed by enterprise-grade reliability. Foundry brings together the best from OpenAI, Anthropic, xAI, Black Forest Labs, Cohere, Meta, Mistral, and Microsoft’s own breakthroughs, all under one roof. Our partnership with Anthropic is a testament to our commitment to giving developers access to the most advanced, safe, and high-performing models in the industry. And now, with GPT-5.1-codex-max joining the Foundry family, the possibilities for intelligent applications and agentic workflows have never been greater. GPT 5.1-codex-max is available today in Microsoft Foundry and accessible in Visual Studio Code via the Foundry extension . Meet GPT-5.1-codex-max: Enterprise-Grade Coding Agent for Complex Projects GPT-5.1-codex-max is engineered for those who build the future. Imagine tackling complex, long-running projects without losing context or momentum. GPT-5.1-codex-max delivers efficiency at scale, cross-platform readiness, and proven performance with top scores on SWE-Bench (77.9), the gold standard for AI coding. With GPT-5.1-codex-max, developers can focus on creativity and problem-solving, while the model handles the heavy lifting. GPT-5.1-codex-max isn’t just powerful; it’s practical, designed to solve real challenges for enterprise developers: Multi-Agent Coding Workflows: Automate repetitive tasks across microservices, maintaining shared context for seamless collaboration. Enterprise App Modernization: Effortlessly refactor legacy .NET and Java applications into cloud-native architectures. Secure API Development: Generate and validate secure API endpoints, with `compliance checks built-in for peace of mind. Continuous Integration Support: Integrate GPT-5.1-codex-max into CI/CD pipelines for automated code reviews and test generation, accelerating delivery cycles. These use cases are just the beginning. GPT-5.1-codex-max is your partner in building robust, scalable, and secure solutions. Foundry: Platform Built for Developers Who Build the Future Foundry is more than a model catalog—it’s an enterprise AI platform designed for developers who need choice, reliability, and speed. • Choice Without Compromise: Access the widest range of models, including frontier models from leading model providers. • Enterprise-Grade Infrastructure: Built-in security, observability, and governance for responsible AI at scale. • Integrated Developer Experience: From GitHub to Visual Studio Code, Foundry connects with tools developers love for a frictionless build-to-deploy journey. Start Building Smarter with GPT-5.1-codex-max in Foundry The future is here, and it’s yours to shape. Supercharge your coding workflows with GPT-5.1-codex-max in Microsoft Foundry today. Learn more about Microsoft Foundry: aka.ms/IgniteFoundryModels. Watch Ignite sessions for deep dives and demos: ignite.microsoft.com. Build faster, smarter, and with confidence on the platform redefining enterprise AI.4KViews3likes5CommentsSecuring Azure AI Applications: A Deep Dive into Emerging Threats | Part 1
Why AI Security Can’t Be Ignored? Generative AI is rapidly reshaping how enterprises operate—accelerating decision-making, enhancing customer experiences, and powering intelligent automation across critical workflows. But as organizations adopt these capabilities at scale, a new challenge emerges: AI introduces security risks that traditional controls cannot fully address. AI models interpret natural language, rely on vast datasets, and behave dynamically. This flexibility enables innovation—but also creates unpredictable attack surfaces that adversaries are actively exploiting. As AI becomes embedded in business-critical operations, securing these systems is no longer optional—it is essential. The New Reality of AI Security The threat landscape surrounding AI is evolving faster than any previous technology wave. Attackers are no longer focused solely on exploiting infrastructure or APIs; they are targeting the intelligence itself—the model, its prompts, and its underlying data. These AI-specific attack vectors can: Expose sensitive or regulated data Trigger unintended or harmful actions Skew decisions made by AI-driven processes Undermine trust in automated systems As AI becomes deeply integrated into customer journeys, operations, and analytics, the impact of these attacks grows exponentially. Why These Threats Matter? Threats such as prompt manipulation and model tampering go beyond technical issues—they strike at the foundational principles of trustworthy AI. They affect: Confidentiality: Preventing accidental or malicious exposure of sensitive data through manipulated prompts. Integrity: Ensuring outputs remain accurate, unbiased, and free from tampering. Reliability: Maintaining consistent model behavior even when adversaries attempt to deceive or mislead the system. When these pillars are compromised, the consequences extend across the business: Incorrect or harmful AI recommendations Regulatory and compliance violations Damage to customer trust Operational and financial risk In regulated sectors, these threats can also impact audit readiness, risk posture, and long-term credibility. Understanding why these risks matter builds the foundation. In the upcoming blogs, we’ll explore how these threats work and practical steps to mitigate them using Azure AI’s security ecosystem. Why AI Security Remains an Evolving Discipline? Traditional security frameworks—built around identity, network boundaries, and application hardening—do not fully address how AI systems operate. Generative models introduce unique and constantly shifting challenges: Dynamic Model Behavior: Models adapt to context and data, creating a fluid and unpredictable attack surface. Natural Language Interfaces: Prompts are unstructured and expressive, making sanitization inherently difficult. Data-Driven Risks: Training and fine-tuning pipelines can be manipulated, poisoned, or misused. Rapidly Emerging Threats: Attack techniques evolve faster than most defensive mechanisms, requiring continuous learning and adaptation. Microsoft and other industry leaders are responding with robust tools—Azure AI Content Safety, Prompt Shields, Responsible AI Frameworks, encryption, isolation patterns—but technology alone cannot eliminate risk. True resilience requires a combination of tooling, governance, awareness, and proactive operational practices. Let's Build a Culture of Vigilance: AI security is not just a technical requirement—it is a strategic business necessity. Effective protection requires collaboration across: Developers Data and AI engineers Cybersecurity teams Cloud platform teams Leadership and governance functions Security for AI is a shared responsibility. Organizations must cultivate awareness, adopt secure design patterns, and continuously monitor for evolving attack techniques. Building this culture of vigilance is critical for long-term success. Key Takeaways: AI brings transformative value, but it also introduces risks that evolve as quickly as the technology itself. Strengthening your AI security posture requires more than robust tooling—it demands responsible AI practices, strong governance, and proactive monitoring. By combining Azure’s built-in security capabilities with disciplined operational practices, organizations can ensure their AI systems remain secure, compliant, and trustworthy, even as new threats emerge. What’s Next? In future blogs, we’ll explore two of the most important AI threats—Prompt Injection and Model Manipulation—and share actionable strategies to mitigate them using Azure AI’s security capabilities. Stay tuned for practical guidance, real-world scenarios, and Microsoft-backed best practices to keep your AI applications secure. Stay Tuned.!660Views3likes0CommentsIntroducing Cohere Rerank 4.0 in Microsoft Foundry
These new retrieval models deliver state-of-the-art accuracy, multilingual coverage across 100+ languages, and breakthrough performance for enterprise search and retrieval-augmented generation (RAG) systems. With Rerank 4.0, customers can dramatically improve the quality of search, reduce hallucinations in RAG applications, and strengthen the reasoning capabilities of their AI agents, all with just a few lines of code. Why Rerank Models Matter for Enterprise AI Retrieval is the foundation of grounded AI systems. Whether you are building an internal assistant, a customer-facing chatbot, or a domain-specific knowledge engine, the quality of the retrieved documents determines the quality of the final answer. Traditional embeddings get you close, but reranking is what gets you the right answer. Rerank improves this step by reading both the query and document together (cross-encoding), producing highly precise semantic relevance scores. This means: More accurate search results More grounded responses in RAG pipelines Lower generative model usage , reducing cost Higher trust and quality across enterprise workloads Introducing Cohere Rerank 4.0 Fast and Rerank 4.0 Pro Microsoft Foundry now offers two versions of Rerank 4.0 to meet different enterprise needs: Rerank 4.0 Fast Best balance of speed and accuracy Same latency as Cohere Rerank 3.5, with significantly higher accuracy Ideal for high-traffic applications and real-time systems Rerank 4.0 Pro Highest accuracy across all benchmarks Excels at complex, reasoning-heavy, domain-specific retrieval Tuned for industries like finance, healthcare, manufacturing, government, and energy Multilingual & Cross-Domain Performance Rerank 4.0 delivers unmatched multilingual and cross-domain performance, supporting more than 100 languages and enabling powerful cross-lingual search across complex enterprise datasets. The models achieve state-of-the-art accuracy in 10 of the world’s most important business languages, including Arabic, Chinese, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Spanish, making them exceptionally well suited for global organizations with multilingual knowledge bases, compliance archives, or international operations. Effortless Integration: Add Rerank to Any System One of the biggest benefits of Rerank 4.0 is how easy it is to adopt. You can add reranking to: Existing enterprise search Vector DB pipelines Keyword search systems Hybrid retrieval setups RAG architectures Agent workflows No infrastructure changes required. Just a few lines of code.This makes it one of the fastest ways to meaningfully upgrade grounding, precision, and search quality in enterprise AI systems. Better RAG, Better Agents, Better Outcomes In Foundry, customers can pair Cohere Rerank 4.0 with Azure Search, vector databases, Agent Service, Azure Functions, Foundry orchestration, and any LLM—including GPT-4.1, Claude, DeepSeek, and Mistral—to deliver more grounded copilots, higher-fidelity agent actions, and better reasoning from cleaner context windows. This reduces hallucinations, lowers LLM spend, and provides a foundational upgrade for mission-critical AI systems. Built for Enterprise: Security, Observability, Governance As a direct from Azure model, Rerank 4.0 is fully integrated with: Azure role-based access control (RBAC) Virtual network isolation Customer-managed keys Logging & observability Entra ID authentication Private deployments You can run Rerank 4.0 in environments that meet the strictest enterprise security and compliance needs. Optimized for Enterprise Models & High-Value Industries Rerank 4.0 is built for sectors where accuracy matters: Finance - Delivers precise retrieval for complex disclosures, compliance documents, and regulatory filings. Healthcare- Accurately retrieves clinical notes, biomedical literature, and care protocols for safer, more reliable insights. Manufacturing- Surfaces the right engineering specs, manuals, and parts data to streamline operations and reduce downtime. Government & Public Sector - Improves access to policy documents, case archives, and citizen service information with semantic precision. Energy- Understands industrial logs, safety manuals, and technical standards to support safer and more efficient operations. Pricing Model Name Deployment Type Azure Resource Region Price /1K Search Units Availability Cohere Rerank 4.0 Pro Global Standard All regions (Check this page for region details) $2.50 Public Preview, Dec 11, 2025 Cohere Rerank 4.0 Fast Global Standard All regions (Check this page for region details) $2.00 Public Preview, Dec 11, 2025 Get Started Today Cohere Rerank 4.0 Fast and Rerank 4.0 Pro are now available in Microsoft Foundry. Rerank 4.0 is one of the simplest and highest impact upgrades you can make to your enterprise AI stack, bringing better retrieval, better agents, and more trustworthy AI to every application.2.7KViews2likes0CommentsBuilding Secure, Governable AI Agents with Microsoft Foundry
The era of AI agents has arrived — and it’s accelerating fast. Organizations are moving beyond simple chatbots that merely respond to requests and toward intelligent agents that can reason, act, and adapt without human supervision. These agents can analyze data, call tools, orchestrate workflows, and make autonomous decisions in real time. As a result, agents are becoming integral members of teams, augmenting and amplifying human capabilities across organizations at scale. But the very strengths that make agents so powerful — their autonomy, intelligence, and ability to operate like virtual teammates — also introduce new risks. Enterprises need a platform that doesn’t just enable agent development but governs it from the start, ensuring security, accountability, and trust at every layer. That’s where Microsoft Foundry comes in. A unified platform for enterprise-ready agents Microsoft Foundry is a unified, interoperable platform for building, optimizing, and governing AI apps and agents at scale. At its core is Foundry Agent Service, which connects models, tools, knowledge, and frameworks into a single, observable runtime. Microsoft Foundry enables companies to shift-left, with security, safety, and governance integrated from the beginning of a developer's workflow. It delivers enterprise-grade controls from setup through production, giving customers the trust, flexibility, and confidence to innovate. 1. Setup: Start with the right foundation Enterprise customers have stringent networking, compliance, and security requirements that must be met before they can even start testing AI capabilities. Microsoft Foundry Agent Service provides a flexible setup experience designed to meet organizations where they are — whether you’re a startup prioritizing speed and simplicity or an enterprise with strict data and compliance needs. Data Control Basic Setup: Ideal for rapid prototyping and getting started quickly. This mode uses platform-managed storage. Standard Setup: Enables fine-grained control over your data by using your own Azure resources and configurations. Networking Bring Your Own Virtual Network (BYO VNet) or enable a Managed Virtual Network (Managed VNet) to enable full network isolation and strict data exfiltration control, ensuring that sensitive information remains within your organization’s trusted boundaries. Using your own virtual network for agents and evaluation workloads in Foundry allows the networking controls to be in your hands, including setting up your own Firewall to control egress traffic, virtual network peering, and setting NSGs and UDRs for managing network traffic. Managed virtual network (preview) creates a virtual network in the Microsoft tenant to handle the egress traffic of your agents. The managed virtual network handles the hassle of setting up network isolation for your Foundry resource and agents, such as setting up the subnet range, IP selection, and subnet delegation. Secrets Management Choose between a Managed Key Vault or Bring Your Own Key Vault to manage secrets and access credentials in a way that aligns with your organization’s security policies. These credentials are critical for establishing secure connections to external resources and tools integrated via the Model Context Protocol (MCP). Encryption Data is always encrypted in transit and at rest using Microsoft-managed keys by default. For enhanced ownership and control, customers can opt for Customer Managed Keys (CMK) to enable key rotation and fine-tuned data governance. Model Governance with AI Gateway Foundry supports Bring Your Own AI Gateway (preview) so enterprises can integrate their existing Foundry and Azure OpenAI model endpoints into Foundry Agent Service behind an AI Gateway for maximum flexibility, control, and governance. Authentication Foundry enforces keyless authentication using Microsoft Entra ID for all end-users wanting to access agents. 2. Development: Build agents you trust Once the environment is configured, Foundry provides tools to develop, control, and evaluate agents before putting them into production. Microsoft Entra Agent ID Every agent in Foundry is assigned a Microsoft Entra Agent ID — a new identity type purpose-built for the security and operational needs of enterprise-scale AI agents. With an agent identity, agents can be recognized, authenticated, and governed just like users, allowing IT teams to enforce familiar controls such as Conditional Access, Identity Protection, Identity Governance, and network policies. In the Microsoft Entra admin center, you will manage your agent inventory which lists all agents in your tenant including those created in Foundry, Copilot Studio, and any 3P agent you register. Unpublished agents (shared agent identity): All unpublished or in-development Foundry agents within the same project share a common agent identity. This design simplifies permission management because unpublished agents typically require the same access patterns and permission configurations. The shared identity approach provides several benefits: Simplified administration: Administrators can centrally manage permissions for all in-development agents within a project Reduced identity sprawl: Using a single identity per project prevents unnecessary identity creation during early experimentation Developer autonomy: Once the shared identity is configured, developers can independently build and test agents without repeatedly configuring new permissions Published Agents (unique agent identity): When you want to share an agent with others as a stable offering, publish it to an agent application. Once published, the agent gets assigned a unique agent identity, tied to the agent application. This establishes durable, auditable boundaries for production agents and enables independent lifecycle, compliance, and monitoring controls. Observability: Tracing, Evaluation, and Monitoring Microsoft Foundry provides a comprehensive observability layer that gives teams deep visibility into agent performance, quality, and operational health across development and production. Foundry’s observability stack brings together traces, logs, evaluations, and safety signals to help developers and administrators understand exactly how an agent arrived at an answer, which tools it used, and where issues may be emerging. This includes: Tracing: Track every step of an agent response including prompts, tool calls, tool responses, and output generation to understand decision paths, latency contributors, and failure points. Evaluations: Foundry provides a comprehensive library of built-in evaluators that measure coherence, groundedness, relevance, safety risks, security vulnerabilities, and agent-specific behaviors such as task adherence or tool-call accuracy. These evaluations help teams catch regressions early, benchmark model quality, and validate that agents behave as intended before moving to production. Monitoring: The Agent Monitoring Dashboard in Microsoft Foundry provides real-time insights into the operational health, performance, and compliance of your AI agents. This dashboard can track token usage, latency, evaluation metrics, and security posture across multi-agent systems. AI red teaming: Foundry’s AI Red Teaming Agent can be used to probe agents with adversarial queries to detect jailbreaks, prompt attacks, and security vulnerabilities. Agent Guardrails and Controls Microsoft Foundry offers safety and security guardrails that can be applied to core models, including image generation models, and agents. Guardrails consist of controls that define three things: What risk to detect (e.g., harmful content, prompt attacks, data leakage) Where to scan for it (user input, tool calls, tool responses, or model output) What action to take (annotate or block) Foundry automatically applies a default safety guardrail to all models and agents, mitigating a broad range of risks — including hate and fairness issues, sexual or violent content, self-harm, protected text/code material, and prompt-injection attempts. For organizations that require more granular control, Foundry supports custom guardrails. These allow teams to tune detection levels, selectively enable or disable risks, and apply different safety policies at the model or agent level. Tool Controls with AI Gateway To enforce tool-level controls, connect AI Gateway to your Foundry project. Once connected, all MCP and OpenAPI tools automatically receive an AI Gateway endpoint, allowing administrators to control how agents call these tools, where they can be accessed from, and who is authorized to use them. You can configure inbound, backend, outbound, and error-handling policies — for example, restricting which IPs can call an API, setting error-handling rules, or applying rate-limiting policies to control how often a tool can be invoked within a given time window. 3. Publish: Securely share your agents with end users Once the proper controls are in place and testing is complete, the agent is ready to be promoted to production. At this stage, enterprises need a secure, governed way to publish and share agents with teammates or customers. Publishing an agent to an Agent Application Anyone with the Azure AI User role on a Foundry project can interact with all agents inside that project, with conversations and state shared across users. This model is ideal for development scenarios like authoring, debugging, and testing, but it is not suitable for distributing agents to broader audiences. Publishing promotes an agent from a development asset into a managed Azure resource with a dedicated endpoint, independent identity, and governance capabilities. When an agent is published, Foundry creates an Agent Application resource designed for secure, scalable distribution. This resource provides a stable endpoint, a unique agent identity with full audit trails, cross-team sharing capabilities, integration with the Entra Agent Registry, and isolation of user data. Instead of granting access to the entire Foundry project, you grant users access only to the Agent Application resource. Integrate with M365/A365 Once your agent is published, you can integrate it into Microsoft 365 or Agent 365. This enables developers to seamlessly deploy agents from Foundry into Microsoft productivity experiences like M365 Copilot or Teams. Users can access and interact with these agents in the canvases they already use every day, providing enterprise-ready distribution with familiar governance and trust boundaries. 4. Production: Govern your agent fleet at scale As organizations expand from a handful of agents to hundreds or thousands, visibility and control become essential. The Foundry Control Plane delivers a unified, real-time view of a company's entire agent ecosystem — spanning Foundry-built and third-party agents. Key capabilities include: Comprehensive agent inventory: View and govern 100% of your agent fleet with sortable, filterable data views. Foundry Control Plane gives developers a clear understanding of every agent—whether built in Foundry, Microsoft Agent Framework, LangChain, or LangGraph—and surfaces them in one place, regardless of where they’re hosted or which cloud they run on. Operational control: Pause or disable an agent when a risk is detected. Real-time alerts: Get notified about policy, evaluation, and security alerts. Policy compliance management: Enforce organization-wide AI content policies and model policies to only allow developers to build agents with approved models in your enterprise. Cost and ROI insights: Real-time cost charts in Foundry give an accurate view of spending across all agents in a project or subscription, with drill-down capabilities to see costs at the individual agent or run level. Agent behavior controls: Apply consistent guardrails across inputs, outputs, and now tool interactions. Health and quality metrics: Review performance and reliability scores for each agent, with drilldowns for deeper analysis and corrective action. Foundry Control Plane brings everything together into a single, connected experience, enabling teams to observe, control, secure, and operate their entire agent fleet. Its capabilities work together seamlessly to help organizations build and manage AI systems that are both powerful and responsibly governed. Build agents with confidence Microsoft Foundry unifies identity, governance, security, observability, and operational control into a single end-to-end platform purpose-built for enterprise AI. With Foundry, companies can choose the setup model that matches their security and compliance posture, apply agent-level guardrails and tool-level controls with AI Gateway, securely publish and share agents across Microsoft 365 and A365, and govern their entire agent fleet through the Foundry Control Plane. At the center of this system is Microsoft Entra Agent ID, ensuring every agent has a managed identity. With these capabilities, organizations can deploy autonomous agents at scale — knowing every interaction is traceable, every risk is mitigated, and every agent is fully accountable. Whether you're building your first agent or managing a fleet of thousands, Foundry provides the foundation to innovate boldly while meeting the trust, compliance, and operational excellence enterprises require. The future of work is one where people and agents collaborate seamlessly — and Microsoft Foundry gives you the platform to build it with confidence. Learn more To dive deeper, watch the Ignite 2025 session: Entra Agent ID and other enterprise superpowers in Microsoft Foundry. To learn more, visit Microsoft Learn and explore resources including the AI Agents for Beginners, Microsoft Agent Framework, and course materials that help you build and operate agents responsibly.807Views2likes0CommentsHybrid AI Using Foundry Local, Microsoft Foundry and the Agent Framework - Part 1
Hybrid AI is quickly becoming one of the most practical architectures for real-world applications—especially when privacy, compliance, or sensitive data handling matter. Today, it’s increasingly common for users to have capable GPUs in their laptops or desktops, and the ecosystem of small, efficient open-source language models has grown dramatically. That makes local inference not only possible, but easy. In this guide, we explore how a locally run agent built with the Agent Framework can combine the strengths of cloud models in Azure AI Foundry with a local LLM running on your own GPU through Foundry Local. This pattern allows you to use powerful cloud reasoning without ever sending raw sensitive data—like medical labs, legal documents, or financial statements—off the device. Part 1 focuses on the foundations of this architecture, using a simple illustrative example to show how local and cloud inference can work together seamlessly under a single agent. Disclaimer: The diagnostic results, symptom checker, and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment. Demonstrating the concept Problem Statement We’ve all done it: something feels off, we get a strange symptom, or a lab report pops into our inbox—and before thinking twice, we copy-paste way too much personal information into whatever website or chatbot seems helpful at the moment. Names, dates of birth, addresses, lab values, clinic details… all shared out of habit, usually because we just want answers quickly. This guide uses a simple, illustrative scenario—a symptom checker with lab report summarization—to show how hybrid AI can help reduce that oversharing. It’s not a medical product or a clinical solution, but it’s a great way to understand the pattern. With Microsoft Foundry, Foundry Local, and the Agent Framework, we can build workflows where sensitive data stays on the user’s machine and is processed locally, while the cloud handles the heavier reasoning. Only a safe, structured summary ever leaves the device. The Agent Framework handles when to use the local model vs. the cloud model, giving us a seamless and privacy-preserving hybrid experience. Demo scenario This demo uses a simple, illustrative symptom-checker to show how hybrid AI keeps sensitive data private while still benefiting from powerful cloud reasoning. It’s not a medical product—just an easy way to demonstrate the pattern: Here’s what happens: A Python agent (Agent Framework) runs locally and can call both cloud models and local tools. Azure AI Foundry (GPT-4o) handles reasoning and triage logic but never sees raw PHI. Foundry Local runs a small LLM (phi-4-mini) on your GPU and processes the raw lab report entirely on-device. A tool function (@ai_function) lets the agent call the local model automatically when it detects lab-like text. The flow is simple: user_message = symptoms + raw lab text agent → calls local tool → local LLM returns JSON cloud LLM → uses JSON to produce guidance Environment setup Foundry Local Service On the local machine with GPU, let's install Foundry local using: PS C: \Windows\system32> winget install Microsoft.FoundryLocal Then let's download our local model, in this case phi-4-mini and test it: PS C:\Windows\system32> foundry model download phi-4-mini Downloading Phi-4-mini-instruct-cuda-gpu:5... [################### ] 53.59 % [Time remaining: about 4m] 5.9 MB/s/s PS C:\Windows\system32> foundry model load phi-4-mini 🕗 Loading model... 🟢 Model phi-4-mini loaded successfully PS C:\Windows\system32> foundry model run phi-4-mini Model Phi-4-mini-instruct-cuda-gpu:5 was found in the local cache. Interactive Chat. Enter /? or /help for help. Press Ctrl+C to cancel generation. Type /exit to leave the chat. Interactive mode, please enter your prompt > Hello can you let me know who you are and which model you are using 🧠 Thinking... 🤖 Hello! I'm Phi, an AI developed by Microsoft. I'm here to help you with any questions or tasks you have. How can I assist you today? > PS C:\Windows\system32> foundry service status 🟢 Model management service is running on http://127.0.0.1:52403/openai/status Now we see the model is accessible with API on the localhost with port 52403. Foundry Local models don’t always use simple names like "phi-4-mini". Each installed model has a specific Model ID that Foundry Local assigns (for example: Phi-4-mini-instruct-cuda-gpu:5 in this case). We now can use the Model ID for a quick test: from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:52403/v1", api_key="ignored") resp = client.chat.completions.create( model="Phi-4-mini-instruct-cuda-gpu:5", messages=[{"role": "user", "content": "Say hello"}]) Returned 200 OK. Microsoft Foundry To handle the cloud part of the hybrid workflow, we start by creating a Microsoft AI Foundry project. This gives us an easy, managed way to use models like GPT-4o-mini —no deployment steps, no servers to configure. You simply point the Agent Framework at your project, authenticate, and you’re ready to call the model. A nice benefit is that Microsoft Foundry and Foundry Local share the same style of API. Whether you call a model in the cloud or on your own machine, the request looks almost identical. This consistency makes hybrid development much easier: the agent doesn’t need different logic for local vs. cloud models—it just switches between them when needed. Under the Hood of Our Hybrid AI Workflow Agent Framework For the agent code, I am using the Agent Framework libraries, and I am giving specific instructions to the agent as per below: from agent_framework import ChatAgent, ai_function from agent_framework.azure import AzureAIAgentClient from azure.identity.aio import AzureCliCredential # ========= Cloud Symptom Checker Instructions ========= SYMPTOM_CHECKER_INSTRUCTIONS = """ You are a careful symptom-checker assistant for non-emergency triage. General behavior: - You are NOT a clinician. Do NOT provide medical diagnosis or prescribe treatment. - First, check for red-flag symptoms (e.g., chest pain, trouble breathing, severe bleeding, stroke signs, one-sided weakness, confusion, fainting). If any are present, advise urgent/emergency care and STOP. - If no red-flags, summarize key factors (age group, duration, severity), then provide: 1) sensible next steps a layperson could take, 2) clear guidance on when to contact a clinician, 3) simple self-care advice if appropriate. - Use plain language, under 8 bullets total. - Always end with: "This is not medical advice." Tool usage: - When the user provides raw lab report text, or mentions “labs below” or “see labs”, you MUST call the `summarize_lab_report` tool to convert the labs into structured data before giving your triage guidance. - Use the tool result as context, but do NOT expose the raw JSON directly. Instead, summarize the key abnormal findings in plain language. """.strip() Referencing the local model Now I am providing a system prompt for the locally inferred model to transform the lab result text into a JSON object with lab results only: # ========= Local Lab Summarizer (Foundry Local + Phi-4-mini) ========= FOUNDRY_LOCAL_BASE = "http://127.0.0.1:52403" # from `foundry service status` FOUNDRY_LOCAL_CHAT_URL = FOUNDRY_LOCAL_BASE + "/v1/chat/completions" # This is the model id you confirmed works: FOUNDRY_LOCAL_MODEL_ID = "Phi-4-mini-instruct-cuda-gpu:5" LOCAL_LAB_SYSTEM_PROMPT = """ You are a medical lab report summarizer running locally on the user's machine. You MUST respond with ONLY one valid JSON object. Do not include any explanation, backticks, markdown, or text outside the JSON. The JSON must have this shape: { "overall_assessment": "<short plain English summary>", "notable_abnormal_results": [ { "test": "string", "value": "string", "unit": "string or null", "reference_range": "string or null", "severity": "mild|moderate|severe" } ] } If you are unsure about a field, use null. Do NOT invent values. """.strip() Agent Framework tool In this next step, we wrap the local Foundry inference inside an Agent Framework tool using the AI_function decorator. This abstraction is more than styler—it is the recommended best practice for hybrid architectures. By exposing local GPU inference as a tool, the cloud-hosted agent can decide when to call it, pass structured arguments, and consume the returned JSON seamlessly. It also ensures that the raw lab text (which may contain PII) stays strictly within the local function boundary, never entering the cloud conversation. Using a tool in this way provides a consistent, declarative interface, enables automatic reasoning and tool-routing by frontier models, and keeps the entire hybrid workflow maintainable, testable, and secure: @ai_function( name="summarize_lab_report", description=( "Summarize a raw lab report into structured abnormalities using a local model " "running on the user's GPU. Use this whenever the user provides lab results as text." ), ) def summarize_lab_report( lab_text: Annotated[str, Field(description="The raw text of the lab report to summarize.")], ) -> Dict[str, Any]: """ Tool: summarize a lab report using Foundry Local (Phi-4-mini) on the user's GPU. Returns a JSON-compatible dict with: - overall_assessment: short text summary - notable_abnormal_results: list of abnormal test objects """ payload = { "model": FOUNDRY_LOCAL_MODEL_ID, "messages": [ {"role": "system", "content": LOCAL_LAB_SYSTEM_PROMPT}, {"role": "user", "content": lab_text}, ], "max_tokens": 256, "temperature": 0.2, } headers = { "Content-Type": "application/json", } print(f"[LOCAL TOOL] POST {FOUNDRY_LOCAL_CHAT_URL}") resp = requests.post( FOUNDRY_LOCAL_CHAT_URL, headers=headers, data=json.dumps(payload), timeout=120, ) resp.raise_for_status() data = resp.json() # OpenAI-compatible shape: choices[0].message.content content = data["choices"][0]["message"]["content"] # Handle string vs list-of-parts if isinstance(content, list): content_text = "".join( part.get("text", "") for part in content if isinstance(part, dict) ) else: content_text = content print("[LOCAL TOOL] Raw content from model:") print(content_text) # Strip ```json fences if present, then parse JSON cleaned = _strip_code_fences(content_text) lab_summary = json.loads(cleaned) print("[LOCAL TOOL] Parsed lab summary JSON:") print(json.dumps(lab_summary, indent=2)) # Return dict – Agent Framework will serialize this as the tool result return lab_summary The case, labs and prompt All patient and provider information in below example is entirely fictitious and used for illustrative purposes only. To illustrate the pattern, this sample prepares the “case” in code: it combines a symptom description with a lab report string and then submits that prompt to the agent. In production, these inputs would be captured from a UI or API. # Example free-text case + raw lab text that the agent can decide to send to the tool case = ( "Teenager with bad headache and throwing up. Fever of 40C and no other symptoms." ) lab_report_text = """ ------------------------------------------- AI Land FAMILY LABORATORY SERVICES 4420 Camino Del Foundry, Suite 210 Gpuville, CA 92108 Phone: (123) 555-4821 | Fax: (123) 555-4822 ------------------------------------------- PATIENT INFORMATION Name: Frontier Model DOB: 04/12/2007 (17 yrs) Sex: Male Patient ID: AXT-442871 Address: 1921 MCP Court, CA 01100 ORDERING PROVIDER Dr. Bot, MD NPI: 1780952216 Clinic: Phi Pediatrics Group REPORT DETAILS Accession #: 24-SDFLS-118392 Collected: 11/14/2025 14:32 Received: 11/14/2025 16:06 Reported: 11/14/2025 20:54 Specimen: Whole Blood (EDTA), Serum Separator Tube ------------------------------------------------------ COMPLETE BLOOD COUNT (CBC) ------------------------------------------------------ WBC ................. 14.5 x10^3/µL (4.0 – 10.0) HIGH RBC ................. 4.61 x10^6/µL (4.50 – 5.90) Hemoglobin .......... 13.2 g/dL (13.0 – 17.5) LOW-NORMAL Hematocrit .......... 39.8 % (40.0 – 52.0) LOW MCV ................. 86.4 fL (80 – 100) Platelets ........... 210 x10^3/µL (150 – 400) ------------------------------------------------------ INFLAMMATORY MARKERS ------------------------------------------------------ C-Reactive Protein (CRP) ......... 60 mg/L (< 5 mg/L) HIGH Erythrocyte Sedimentation Rate ... 32 mm/hr (0 – 15 mm/hr) HIGH ------------------------------------------------------ BASIC METABOLIC PANEL (BMP) ------------------------------------------------------ Sodium (Na) .............. 138 mmol/L (135 – 145) Potassium (K) ............ 3.9 mmol/L (3.5 – 5.1) Chloride (Cl) ............ 102 mmol/L (98 – 107) CO2 (Bicarbonate) ........ 23 mmol/L (22 – 29) Blood Urea Nitrogen (BUN) 11 mg/dL (7 – 20) Creatinine ................ 0.74 mg/dL (0.50 – 1.00) Glucose (fasting) ......... 109 mg/dL (70 – 99) HIGH ------------------------------------------------------ LIVER FUNCTION TESTS ------------------------------------------------------ AST ....................... 28 U/L (0 – 40) ALT ....................... 22 U/L (0 – 44) Alkaline Phosphatase ...... 144 U/L (65 – 260) Total Bilirubin ........... 0.6 mg/dL (0.1 – 1.2) ------------------------------------------------------ NOTES ------------------------------------------------------ Mild leukocytosis and elevated inflammatory markers (CRP, ESR) may indicate an acute infectious or inflammatory process. Glucose slightly elevated; could be non-fasting. ------------------------------------------------------ END OF REPORT SDFLS-CLIA ID: 05D5554973 This report is for informational purposes only and not a diagnosis. ------------------------------------------------------ """ # Single user message that gives both the case and labs. # The agent will see that there are labs and call summarize_lab_report() as a tool. user_message = ( "Patient case:\n" f"{case}\n\n" "Here are the lab results as raw text. If helpful, you can summarize them first:\n" f"{lab_report_text}\n\n" "Please provide non-emergency triage guidance." ) The Hybrid Agent code Here’s where the hybrid behavior actually comes together. By this point, we’ve defined a local tool that talks to Foundry Local and configured access to a cloud model in Azure AI Foundry. In the main() function, the Agent Framework ties these pieces into a single workflow. The agent runs locally, receives a message containing both symptoms and a raw lab report, and decides when to call the local tool. The lab report is summarized on your GPU, and only the structured JSON is passed to the cloud model for reasoning. The snippet below shows how we attach the tool to the agent and trigger both local inference and cloud guidance within one natural-language prompt # ========= Hybrid Main (Agent uses the local tool) ========= async def main(): ... async with ( AzureCliCredential() as credential, ChatAgent( chat_client=AzureAIAgentClient(async_credential=credential), instructions=SYMPTOM_CHECKER_INSTRUCTIONS, # 👇 Tool is now attached to the agent tools=[summarize_lab_report], name="hybrid-symptom-checker", ) as agent, ): result = await agent.run(user_message) print("\n=== Symptom Checker (Hybrid: Local Tool + Cloud Agent) ===\n") print(result.text) if __name__ == "__main__": asyncio.run(main()) Testing the Hybrid Agent Now I am running the agent code from VSCode and can see the local inference happening when lab was submitted. Then results are formatted, PII omitted and the GPT-40 model can process the symptom along the results What's next In this example, the agent runs locally and pulls in both cloud and local inference. In Part 2, we’ll explore the opposite architecture: a cloud-hosted agent that can safely call back into a local LLM through a secure gateway. This opens the door to more advanced hybrid patterns where tools running on edge devices, desktops, or on-prem systems can participate in cloud-driven workflows without exposing sensitive data. References Agent Framework: https://github.com/microsoft/agent-framework Repo for the code available here:1.2KViews2likes0CommentsIntroducing OpenAI’s GPT-image-1.5 in Microsoft Foundry
Developers building with visual AI can often run into the same frustrations: images that drift from the prompt, inconsistent object placement, text that renders unpredictably, and editing workflows that break when iterating on a single asset. That’s why we are excited to announce OpenAI's GPT Image 1.5 is now generally available in Microsoft Foundry. This model can bring sharper image fidelity, stronger prompt alignment, and faster image generation that supports iterative workflows. Starting today, customers can request access to the model and start building in the Foundry platform. Meet GPT Image 1.5 AI driven image generation began with early models like OpenAI's DALL-E, which introduced the ability to transform text prompts into visuals. Since then, image generation models have been evolving to enhance multimodal AI across industries. GPT Image 1.5 represents continuous improvement in enterprise-grade image generation. Building on the success of GPT Image 1 and GPT Image 1 mini, these enhanced models introduce advanced capabilities that cater to both creative and operational needs. The new image models offer: Text-to-image: Stronger instruction following and highly precise editing. Image-to-image: Transform existing images to iteratively refine specific regions Improved visual fidelity: More detailed scenes and realistic rendering. Accelerated creation times: Up to 4x faster generation speed. Enterprise integration: Deploy and scale securely in Microsoft Foundry. GPT Image 1.5 delivers stronger image preservation and editing capabilities, maintaining critical details like facial likeness, lighting, composition, and color tone across iterative changes. You’ll see more consistent preservation of branded logos and key visuals, making it especially powerful for marketing, brand design, and ecommerce workflows—from graphics and logo creation to generating full product catalogs (variants, environments, and angles) from a single source image. Benchmarks Based on an internal Microsoft dataset, GPT Image 1.5 performs higher than other image generation models in prompt alignment and infographics tasks. It focuses on making clear, strong edits – performing best on single-turn modification, delivering the higher visual quality in both single and multi-turn settings. The following results were found across image generation and editing: Text to image Prompt alignment Diagram / Flowchart GPT Image 1.5 91.2% 96.9% GPT Image 1 87.3% 90.0% Qwen Image 83.9% 33.9% Nano Banana Pro 87.9% 95.3% Image editing Evaluation Aspect Modification Preservation Visual Quality Face Preservation Metrics BinaryEval SC (semantic) DINO (Visual) BinaryEval AuraFace Single-turn GPT image 1 99.2% 51.0% 0.14 79.5% 0.30 Qwen image 81.9% 63.9% 0.44 76.0% 0.85 GPT Image 1.5 100% 56.77% 0.14 89.96% 0.39 Multi-turn GPT Image 1 93.5% 54.7% 0.10 82.8% 0.24 Qwen image 77.3% 68.2% 0.43 77.6% 0.63 GPT image 1.5 92.49% 60.55% 0.15 89.46% 0.28 Using GPT Image 1.5 across industries Whether you’re creating immersive visuals for campaigns, accelerating UI and product design, or producing assets for interactive learning GPT Image 1.5 gives modern enterprises the flexibility and scalability they need. Image models can allow teams to drive deeper engagement through compelling visuals, speed up design cycles for apps, websites, and marketing initiatives, and support inclusivity by generating accessible, high‑quality content for diverse audiences. Watch how Foundry enables developers to iterate with multimodal AI across Black Forest Labs, OpenAI, and more: Microsoft Foundry empowers organizations to deploy these capabilities at scale, integrating image generation seamlessly into enterprise workflows. Explore the use of AI image generation here across industries like: Retail: Generate product imagery for catalogs, e-commerce listings, and personalized shopping experiences. Marketing: Create campaign visuals and social media graphics. Education: Develop interactive learning materials or visual aids. Entertainment: Edit storyboards, character designs, and dynamic scenes for films and games. UI/UX: Accelerate design workflows for apps and websites. Microsoft Foundry provides security and compliance with built-in content safety filters, role-based access, network isolation, and Azure Monitor logging. Integrated governance via Azure Policy, Purview, and Sentinel gives teams real-time visibility and control, so privacy and safety are embedded in every deployment. Learn more about responsible AI at Microsoft. Pricing Model Pricing (per 1M tokens) - Global GPT-image-1.5 Input Tokens: $8 Cached Input Tokens: $2 Output Tokens: $32 Cost efficiency improves as well: image inputs and outputs are now cheaper compared to GPT Image 1, enabling organizations to generate and iterate on more creative assets within the same budget. For detailed pricing, refer here. Getting started Learn more about image generation, explore code samples, and read about responsible AI protections here. Try GPT Image 1.5 in Microsoft Foundry and start building multimodal experiences today. Whether you’re designing educational materials, crafting visual narratives, or accelerating UI workflows, these models deliver the flexibility and performance your organization needs.4.8KViews1like1CommentRun local AI on any PC or Mac — Microsoft Foundry Local
Leverage full hardware performance, keep data private, reduce latency, and predict costs, even in offline or low-connectivity scenarios. Simplify development and deploy AI apps across diverse hardware and OS platforms with the Foundry Local SDK. Manage models locally, switch AI engines easily, and deliver consistent, multi-modal experiences, voice or text, without complex cross-platform setup. Raji Rajagopalan, Microsoft CoreAI Vice President, shares how to start quickly, test locally, and scale confidently. No cloud needed. Build AI apps once and run them locally on Windows, macOS, & mobile. Get started with Foundry Local SDK. Lower latency, data privacy, and cost predictability. All in the box with Foundry Local. Start here. Build once, deploy everywhere. Foundry Local ensures your AI app works on Intel, AMD, Qualcomm, and NVIDIA devices. See how it works. QUICK LINKS: 00:00 — Run AI locally 01:48 — Local AI use cases 02:23 — App portability 03:18 — Run apps on any device 05:14 — Run on older devices 05:58 — Run apps on MacOS 06:18 — Local AI is Multi-modal 07:25 — How it works 08:20 — How to get it running on your device 09:26 — Start with AI Toolkit in VS Code with new SDK 10:11 — Wrap up Link References Check out https://aka.ms/foundrylocalSDK Build an app using code in our repo at https://aka.ms/foundrylocalsamples Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: - If you want to build apps with powerful AI optimized to run locally across different PC configurations, in addition to macOS and mobile platforms, while taking advantage of bare metal performance, where your same app can run without modification or relying on the cloud, Foundry Local with the new SDK is the way to go. Today, we’ll dig deeper into how it works and how you can use it as a developer. I’m joined today by Raji Rajagopalan, who leads the Foundry Local team at Microsoft. Welcome. - I’m very excited to be here, Jeremy. Thanks for having me. - And thanks so much for joining us today, especially given how fast things are moving quickly in this space. You know, the idea of running AI locally has really shifted from exploration, like we saw over a year ago, to real production proper use cases right now. - Yeah, things are definitely moving fast. We are at a point for local AI now where several things are converging. First, of course, hardware has gotten more powerful with NPUs and GPUs available. Second, we now have smarter and more efficient AI models which need less power and memory to run well. Also, better quantization and distillation mean that even big models can fit and work well directly on your device. This chart, for example, compares the GPT-3.5 Frontier Model, which was one of the leading models around two years ago. And if I compare the accuracy of its output with a smaller quantized model like gpt-oss, you’ll see that bigger isn’t always better. The gpt-oss model exceeds the larger GPT-3.5 LLM on accuracy. And third, as I’ll show you, using the new Foundry Local SDK, the developer experience for building local AI is now a lot simpler. It removes a ton of complexity for getting your apps right into production. And because the AI is local, you don’t even need an Azure subscription. - Okay, so what scenarios do you see this unlocking? - Well, there’s a lot of scenarios that local AI can be quite powerful, actually. For example, if you are offline on a plane or are working in a disconnected or poor connectivity location, latency is an issue. These models will still run. There’s no reliance on the internet. Next, if you have specific privacy requirements for your data, data used for AI reasoning can be stored locally or within your corporate network versus the cloud. And because inference using Foundry Local is free, the costs are more predictable. - So lower latency data privacy, cost predictability. Now, you also mentioned a simpler developer experience with a new Foundry Local SDK. So how does Foundry Local change things? - Well, the biggest issue that we are addressing is app portability. For example, as a developer today, if you wanted to build an AI app that runs locally on most device hardware and across different OS platforms, you’d have to write the device selection logic yourselves and debug cross-platform issues. Once you’re done that, you would need to package it for the different execution providers by hardware type and different device platforms just so that your app could run on those platforms and across different device configurations. It’s an error-prone process. Foundry Local, on the other hand, makes it simple. We have worked extensively with our silicon partners like NVIDIA, Intel, Qualcomm, and AMD to make sure that Foundry Local models just work right on the hardware that you have. - Which is great, because as a developer, you can just focus on building your app. The same app is going to target and work on any consuming device then, right? - That’s right. In fact, I’ll show you. I have built this healthcare concierge app that’s an offline assistant for addressing healthcare questions using information private to me, which is useful when I’m traveling. It’s using a number of models, including the quantized 1.5 billion parameter Qwen model, and it has options to choose other models. This includes the Whisper model for spoken input using speech-to-text conversion, and it can pull from multiple private local data sources using semantic search to retrieve the information it needs to generate responses. I’m going to run the app on different devices with diverse hardware. I’ll start with Windows, and after that I’ll show you how it works on other operating systems. Our first device has a super common configuration. It’s a Windows laptop running Intel Core previous generation with in integrated GPU and no NPU. I have another device, which is an AMD previous-generation PC, also without an NPU. Next, I have a Qualcomm Snapdragon X Plus PC with an NPU. And my fourth device is an Intel PC with an NVIDIA RTX GPU. I’m going to use the same prompt on each of these devices using text first. I’ll prompt: If I have 15 minutes, what exercises can I do from anywhere to stay healthy? And as I run each of these, you’ll see that the model is being influenced across different chipsets. This is using the same app package to support all of these configurations. The model generates its response using its real world training and reasoning over documents related to my medical history. By the way, I’m just using synthetic data for this demo. It’s not my actual medical history. But the most important thing is that this is all happening locally. My private data stays private. Nothing is traversing to or from the internet. - Right, and I can see this being really great for any app scenario that requires more stringent data compliance. You know, based on the configs that you ran across those four different machines that you remoted into, they were relatively new, though. Would it work on older hardware as well? - Yeah, it will. The beauty of Foundry Local is that it makes AI accessible on almost any device. In fact, this time I’m remoted into an eighth-gen Intel PC. It has integrated graphics and eight gigs of RAM, as you can see here in the task manager. I’ll minimize this window and move over to the same app we just saw. I’ll run the same prompt, and you’ll see that it still runs even though this PC was built and purchased in 2019. - And as we saw, that went a little bit slower than some of the other devices, but that’s not really the point here. It means that you as a developer, you can use the same package and it’ll work across multiple generations and types of silicon. - Right, and you can run the same app on macOS as well. Right here, on my Mac, I’ll run the same code. We have here a Foundry Local packaged for macOS. I’ll run the same prompt as before, and you’ll see that just like it ran on my Windows devices, it runs on my Mac as well. The app experience is consistent everywhere. And the cool thing is that local AI is also multimodal. Because this app supports voice input, this time I’ll speak out my prompt. First, to show how easy it is to change the underlying AI model. I’ll swap it to Phi-4-mini-reasoning. Like before, it is set up to use locally stored information for grounding, and the model’s real-world understanding to respond. This time I’ll prompt it with: I’m about to go on a nine-hour flight and will be in London. Given my blood results, what food should I avoid, and how can I improve my health while traveling? And you’ll see that it’s converted my spoken words to text. This prompt requires a bit more reasoning to formulate a response. With the think steps, we can watch how it breaks down, what it needs to do, it’s reasoning over the test results, and how the flight might affect things. And voila, we have the answer. This is the type of response that you might have expected running on larger models and compute in the cloud, but it’s all running locally with sophistication and reasoning. And by the way, if you want to build an app like this, we have published the code in our repo at aka.ms/foundrylocalsamples. - Okay, so what is Foundry Local doing then to make all of this possible? - There’s lots going on under the covers, actually. So let’s unpack. First, Foundry Local lets you discover the latest quantized AI models directly from the Foundry service and bring them to your local device. Once cached, these models can run locally for your apps with zero internet connectivity. Second, when you run your apps, Foundry Local provides a unified runtime built on ONNX for portability. It handles the translation and optimization of your app for performance, tailored to the hardware configuration it’s running on, and it’ll select the right execution provider, whether it’s OpenVINO for Intel, the AMD EP, NVIDIA CUDA, or Qualcommm’s QNN with NPU acceleration and more. So there’s no need to juggle multiple SDKs or frameworks. And third, as your apps interact with cached local models, Foundry Local manages model inference. - Okay, so what would I or anyone watching need to do to get this running on their device? - It’s pretty easy. I’ll show you the manual steps for PC or Mac for anyone to get the basics running. And as a developer, this can all be done programmatically with your application’s installer. Here I have the terminal open. To install Foundry Local using PowerShell, I’ll run winget install Microsoft.FoundryLocal. Of course, on a Mac, you would use brew commands. And once that’s done, you can test it out quickly by getting a model and running something like Foundry model run qwen 2.5–0.5b, or whichever model you prefer. And this process dynamically checks if the model is already local, and if not, it’ll download the right model variant automatically and load it into memory. The time it’ll take to locally cache the model will depend on your network configuration. Once it’s ready, I can stay in the terminal and run a prompt. So I’ll ask: Give me three tips to help me manage anxiety for a quick test. And you’ll see that the local model is responding to my prompt, and it’s running 100% local on this PC. - Okay, so now you have all the baseline components installed on your device. Now, how do you go about building an app like we saw before? - The best way to start is in AI Toolkit in VS Code. And with our new SDK, this lets you run Foundry Local models, manage the local cache, and visualize results within VS Code. So let me show you here. I have my project open in Visual Studio Code with the AI Toolkit installed. This is using OpenAI SDK, as you can see here. It is a C# app using Foundry Local to load and interact with local models on the user device. In this case, we are using a Qwen model by default for our chat completion. And it uses OpenAI Whisper Tiny for speech to text to make voice prompting work. So that’s the code. From there you can package it for Windows and Mac, and you can package it for Android too. - It’s really great to see Foundry Local in action. And I can really see it helping out with lighting up different local AI across the different devices and scenarios. So for all the developers who are watching right now, what’s the best way to get started? - I would say try it out. You don’t need specialized hardware or a dev kit to get started. First, to just get a flavor for Foundry Local on Windows, use the steps I showed with winget, and on macOS, use Brew. Then, and this is where you unlock the most, integrated into your local apps using the SDK. And you can check out aka.ms/foundrylocalSDK. - Thanks, Raji, It’s really great to see how far things have come in this space. And thank you for joining us today. Be sure to subscribe to Mechanics if you haven’t already. We’ll see you again soon.1.5KViews1like0CommentsHybrid AI Using Foundry Local, Microsoft Foundry and the Agent Framework - Part 2
Background In Part 1, we explored how a local LLM running entirely on your GPU can call out to the cloud for advanced capabilities The theme was: Keep your data local. Pull intelligence in only when necessary. That was local-first AI calling cloud agents as needed. This time, the cloud is in charge, and the user interacts with a Microsoft Foundry hosted agent — but whenever it needs private, sensitive, or user-specific information, it securely “calls back home” to a local agent running next to the user via MCP. Think of it as: The cloud agent = specialist doctor The local agent = your health coach who you trust and who knows your medical history The cloud never sees your raw medical history The local agent only shares the minimum amount of information needed to support the cloud agent reasoning This hybrid intelligence pattern respects privacy while still benefiting from hosted frontier-level reasoning. Disclaimer: The diagnostic results, symptom checker, and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment. Architecture Overview The diagram illustrates a hybrid AI workflow where a Microsoft Foundry–hosted agent in Azure works together with a local MCP server running on the user’s machine. The cloud agent receives user symptoms and uses a frontier model (GPT-4.1) for reasoning, but when it needs personal context—like medical history—it securely calls back into the local MCP Health Coach via a dev-tunnel. The local MCP server queries a local GPU-accelerated LLM (Phi-4-mini via Foundry Local) along with stored health-history JSON, returning only the minimal structured background the cloud model needs. The cloud agent then combines both pieces—its own reasoning plus the local context—to produce tailored recommendations, all while sensitive data stays fully on the user’s device. Hosting the agent in Microsoft Foundry on Azure provides a reliable, scalable orchestration layer that integrates directly with Azure identity, monitoring, and governance. It lets you keep your logic, policies, and reasoning engine in the cloud, while still delegating private or resource-intensive tasks to your local environment. This gives you the best of both worlds: enterprise-grade control and flexibility with edge-level privacy and efficiency. Demo Setup Create the Cloud Hosted Agent Using Microsoft Foundry, I created an agent in the UI and pick gpt-4.1 as model: I provided rigorous instructions as system prompt: You are a medical-specialist reasoning assistant for non-emergency triage. You do NOT have access to the patient’s identity or private medical history. A privacy firewall limits what you can see. A local “Personal Health Coach” LLM exists on the user’s device. It holds the patient’s full medical history privately. You may request information from this local model ONLY by calling the MCP tool: get_patient_background(symptoms) This tool returns a privacy-safe, PII-free medical summary, including: - chronic conditions - allergies - medications - relevant risk factors - relevant recent labs - family history relevance - age group Rules: 1. When symptoms are provided, ALWAYS call get_patient_background BEFORE reasoning. 2. NEVER guess or invent medical history — always retrieve it from the local tool. 3. NEVER ask the user for sensitive personal details. The local model handles that. 4. After the tool runs, combine: (a) the patient_background output (b) the user’s reported symptoms to deliver high-level triage guidance. 5. Do not diagnose or prescribe medication. 6. Always end with: “This is not medical advice.” You MUST display the section “Local Health Coach Summary:” containing the JSON returned from the tool before giving your reasoning. Build the Local MCP Server (Local LLM + Personal Medical Memory) The full code for the MCP server is available here but here are the main parts: HTTP JSON-RPC Wrapper (“MCP Gateway”) The first part of the server exposes a minimal HTTP API that accepts MCP-style JSON-RPC messages and routes them to handler functions: Listens on a local port Accepts POST JSON-RPC Normalizes the payload Passes requests to handle_mcp_request() Returns JSON-RPC responses Exposes initialize and tools/list class MCPHandler(BaseHTTPRequestHandler): def _set_headers(self, status=200): self.send_response(status) self.send_header("Content-Type", "application/json") self.end_headers() def do_GET(self): self._set_headers() self.wfile.write(b"OK") def do_POST(self): content_len = int(self.headers.get("Content-Length", 0)) raw = self.rfile.read(content_len) print("---- RAW BODY ----") print(raw) print("-------------------") try: req = json.loads(raw.decode("utf-8")) except: self._set_headers(400) self.wfile.write(b'{"error":"Invalid JSON"}') return resp = handle_mcp_request(req) self._set_headers() self.wfile.write(json.dumps(resp).encode("utf-8")) Tool Definition: get_patient_background This section defines the tool contract exposed to Azure AI Foundry. The hosted agent sees this tool exactly as if it were a cloud function: Advertises the tool via tools/list Accepts arguments passed from the cloud agent Delegates local reasoning to the GPU LLM Returns structured JSON back to the cloud def handle_mcp_request(req): method = req.get("method") req_id = req.get("id") if method == "tools/list": return { "jsonrpc": "2.0", "id": req_id, "result": { "tools": [ { "name": "get_patient_background", "description": "Returns anonymized personal medical context using your local LLM.", "inputSchema": { "type": "object", "properties": { "symptoms": {"type": "string"} }, "required": ["symptoms"] } } ] } } if method == "tools/call": tool = req["params"]["name"] args = req["params"]["arguments"] if tool == "get_patient_background": symptoms = args.get("symptoms", "") summary = summarize_patient_locally(symptoms) return { "jsonrpc": "2.0", "id": req_id, "result": { "content": [ { "type": "text", "text": json.dumps(summary) } ] } } Local GPU LLM Caller (Foundry Local Integration) This is where personalization happens — entirely on your machine, not in the cloud: Calls the local GPU LLM through Foundry Local Injects private medical data (loaded from a file or memory) Produces anonymized structured outputs Logs debug info so you can see when local inference is running FOUNDRY_LOCAL_BASE_URL = "http://127.0.0.1:52403" FOUNDRY_LOCAL_CHAT_URL = f"{FOUNDRY_LOCAL_BASE_URL}/v1/chat/completions" FOUNDRY_LOCAL_MODEL_ID = "Phi-4-mini-instruct-cuda-gpu:5" def summarize_patient_locally(symptoms: str): print("[LOCAL] Calling Foundry Local GPU model...") payload = { "model": FOUNDRY_LOCAL_MODEL_ID, "messages": [ {"role": "system", "content": PERSONAL_SYSTEM_PROMPT}, {"role": "user", "content": symptoms} ], "max_tokens": 300, "temperature": 0.1 } resp = requests.post( FOUNDRY_LOCAL_CHAT_URL, headers={"Content-Type": "application/json"}, data=json.dumps(payload), timeout=60 ) llm_content = resp.json()["choices"][0]["message"]["content"] print("[LOCAL] Raw content:\n", llm_content) cleaned = _strip_code_fences(llm_content) return json.loads(cleaned) Start a Dev Tunnel Now we need to do some plumbing work to make sure the cloud can resolve the MCP endpoint. I used Azure Dev Tunnels for this demo. The snippet below shows how to set that up in 4 PowerShell commands: PS C:\Windows\system32> winget install Microsoft.DevTunnel PS C:\Windows\system32> devtunnel create mcp-health PS C:\Windows\system32> devtunnel port create mcp-health -p 8081 --protocol http PS C:\Windows\system32> devtunnel host mcp-health I have now a public endpoint: https://xxxxxxxxx.devtunnels.ms:8081 Wire Everything Together in Azure AI Foundry Now let's us the UI to add a new custom tool as MCP for our agent: And point to the public endpoint created previously: Voila, we're done with the setup, let's test it Demo: The Cloud Agent Talks to Your Local Private LLM I am going to use a simple prompt in the agent: “Hi, I’ve been feeling feverish, fatigued, and a bit short of breath since yesterday. Should I be worried?” Disclaimer: The diagnostic results and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment. Below is the sequence shown in real time: Conclusion — Why This Hybrid Pattern Matters Hybrid AI lets you place intelligence exactly where it belongs: high-value reasoning in the cloud, sensitive or contextual data on the local machine. This protects privacy while reducing cloud compute costs—routine lookups, context gathering, and personal history retrieval can all run on lightweight local models instead of expensive frontier models. This pattern also unlocks powerful real-world applications: local financial data paired with cloud financial analysis, on-device coding knowledge combined with cloud-scale refactoring, or local corporate context augmenting cloud automation agents. In industrial and edge environments, local agents can sit directly next to the action—embedded in factory sensors, cameras, kiosks, or ambient IoT devices—providing instant, private intelligence while the cloud handles complex decision-making. Hybrid AI turns every device into an intelligent collaborator, and every cloud agent into a specialist that can safely leverage local expertise. References Get started with Foundry Local - Foundry Local: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/get-started?view=foundry-classic Using MCP tools with Agents (Microsoft Agent Framework) — https://learn.microsoft.com/en-us/agent-framework/user-guide/model-context-protocol/using-mcp-tools Microsoft Learn Build Agents using Model Context Protocol on Azure — https://learn.microsoft.com/en-us/azure/developer/ai/intro-agents-mcp Microsoft Learn Full demo repo available here.610Views1like0Comments