microsoft foundry
11 TopicsBuild and Deploy a Microsoft Foundry Hosted Agent: A Hands-On Workshop
Agents are easy to demo, hard to ship. Most teams can put together a convincing prototype quickly. The harder part starts afterwards: shaping deterministic tools, validating behaviour with tests, building a CI path, packaging for deployment, and proving the experience through a user-facing interface. That is where many promising projects slow down. This workshop helps you close that gap without unnecessary friction. You get a guided path from local run to deployment handoff, then complete the journey with a working chat UI that calls your deployed hosted agent through the project endpoint. What You Will Build This is a hands-on, end-to-end learning experience for building and deploying AI agents with Microsoft Foundry. The lab provides a guided and practical journey through hosted-agent development, including deterministic tool design, prompt-guided workflows, CI validation, deployment preparation, and UI integration. It’s designed to reduce setup friction with a ready-to-run experience. It is a prompt-based development lab using Copilot guidance and MCP-assisted workflow options during deployment. It’s a .NET 10 workshop that includes local development, Copilot-assisted coding, CI, secure deployment to Azure, and a working chat UI. A local hosted agent that responds on the responses contract Deterministic tool improvements in core logic with xUnit coverage A GitHub Actions CI workflow for restore, build, test, and container validation An Azure-ready deployment path using azd, ACR image publishing, and Foundry manifest apply A Blazor chat UI that calls openai/v1/responses with agent_reference A repeatable implementation shape that teams can adapt to real projects Who This Lab Is For AI developers and software engineers who prefer learning by building Motivated beginners who want a guided, step-by-step path Experienced developers who want a practical hosted-agent reference implementation Architects evaluating deployment shape, validation strategy, and operational readiness Technical decision-makers who need to see how demos become deployable systems Why Hosted Agents Hosted agents run your code in a managed environment. That matters because it reduces the amount of infrastructure plumbing you need to manage directly, while giving you a clearer path to secure, observable, team-friendly deployments. Prompt-only demos are still useful. They are quick, excellent for ideation, and often the right place to start. Hosted agents complement that approach when you need custom code, tool-backed logic, and a deployment process that can be repeated by a team. Think of this lab as the bridge: you keep the speed of prompt-based iteration, then layer in the real-world patterns needed to run reliably. What You Will Learn 1) Orchestration You will practise workflow-oriented reasoning through implementation-shape recommendations and multi-step readiness scenarios. The lab introduces orchestration concepts at a practical level, rather than as a dedicated orchestration framework deep dive. 2) Tool Integration You will connect deterministic tools and understand how tool calls fit into predictable execution paths. This is a core focus of the workshop and is backed by tests in the solution. 3) Retrieval Patterns (What This Lab Covers Today) This workshop does not include a full RAG implementation with embeddings and vector search. Instead, it focuses on deterministic local tools and hosted-agent response flow, giving you a strong foundation before adding retrieval infrastructure in a follow-on phase. 4) Observability You will see light observability foundations through OpenTelemetry usage in the host and practical verification during local and deployed checks. This is introductory coverage intended to support debugging and confidence building. 5) Responsible AI You will apply production-minded safety basics, including secure secret handling and review hygiene. A full Responsible AI policy and evaluation framework is not the primary goal of this workshop, but the workflow does encourage safe habits from the start. 6) Secure Deployment Path You will move from local implementation to Azure deployment with a secure, practical workflow: azd provisioning, ACR publishing, manifest deployment, hosted-agent start, status checks, and endpoint validation. The Learning Journey The overall flow is simple and memorable: clone, open, run, iterate, deploy, observe. clone -> open -> run -> iterate -> deploy -> observe You are not expected to memorize every command. The lab is structured to help you learn through small, meaningful wins that build confidence. Your First 15 Minutes: Quick Wins Open the repo and understand the lab structure in a few minutes Set project endpoint and model deployment environment variables Run the host locally and validate the responses endpoint Inspect the deterministic tools in WorkshopLab.Core Run tests and see how behaviour changes are verified Review the deployment path so local work maps to Azure steps Understand how the UI validates end-to-end behaviour after deployment Leave the first session with a working baseline and a clear next step That first checkpoint is important. Once you see a working loop on your own machine, the rest of the workshop becomes much easier to finish. Using Copilot and MCP in the Workflow This lab emphasises prompt-based development patterns that help you move faster while still learning the underlying architecture. You are not only writing code, you are learning to describe intent clearly, inspect generated output, and iterate with discipline. Copilot supports implementation and review in the coding labs. MCP appears as a practical deployment option for hosted-agent lifecycle actions, provided your tools are authenticated to the correct tenant and project context. Together, this creates a development rhythm that is especially useful for learning: Define intent with clear prompts Generate or adjust implementation details Validate behaviour through tests and UI checks Deploy and observe outcomes in Azure Refine based on evidence, not guesswork That same rhythm transfers well to real projects. Even if your production environment differs, the patterns from this workshop are adaptable. Production-Minded Tips As you complete the lab, keep a production mindset from day one: Reliability: keep deterministic logic small, testable, and explicit Security: Treat secrets, identity, and access boundaries as first-class concerns Observability: use telemetry and status checks to speed up debugging Governance: keep deployment steps explicit so teams can review and repeat them You do not need to solve everything in one pass. The goal is to build habits that make your agent projects safer and easier to evolve. Start Today: If you have been waiting for the right time to move from “interesting demo” to “practical implementation”, this is the moment. The workshop is structured for self-study, and the steps are designed to keep your momentum high. Start here: https://github.com/microsoft/Hosted_Agents_Workshop_Lab Want deeper documentation while you go? These official guides are great companions: Hosted agent quickstart Hosted agent deployment guide When you finish, share what you built. Post a screenshot or short write-up in a GitHub issue/discussion, on social, or in comments with one lesson learned. Your example can help the next developer get unstuck faster. Copy/Paste Progress Checklist [ ] Clone the workshop repo [ ] Complete local setup and run the agent [ ] Make one prompt-based behaviour change [ ] Validate with tests and chat UI [ ] Run CI checks [ ] Provision and deploy via Azure and Foundry workflow [ ] Review observability signals and refine [ ] Share what I built + one takeaway Common Questions How long does it take? Most developers can complete a meaningful pass in a few focused sessions of 60-75 mins. You can get the first local success quickly, then continue through deployment and refinement at your own pace. Do I need an Azure subscription? Yes, for provisioning and deployment steps. You can still begin local development and testing before completing all Azure activities. Is it beginner-friendly? Yes. The labs are written for beginners, run in sequence, and include expected outcomes for each stage. Can I adapt it beyond .NET? Yes. The implementation in this workshop is .NET 10, but the architecture and development patterns can be adapted to other stacks. What if I am evaluating for a team? This lab is a strong team evaluation asset because it demonstrates end-to-end flow: local dev, integration patterns, CI, secure deployment, and operational visibility. Closing This workshop gives you more than theory. It gives you a practical path from first local run to deployed hosted agent, backed by tests, CI, and a user-facing UI validation loop. If you want a build-first route into Microsoft Foundry hosted-agent development, this is an excellent place to start. Begin now: https://github.com/microsoft/Hosted_Agents_Workshop_Lab327Views0likes0CommentsStep-by-Step: Deploy the Architecture Review Agent Using AZD AI CLI
Building an AI agent is easy; operating it is an infrastructure trap. Discover how to use the azd ai CLI extension to streamline your workflow. From local testing to deploying a live Microsoft Foundry hosted agent and publishing it to Microsoft Teams—learn how to do it all without writing complex deployment scripts or needing admin permissions.333Views0likes0CommentsMicrosoft Foundry Model Router: A Developer's Guide to Smarter AI Routing
Introduction When building AI-powered applications on Azure, one of the most impactful decisions you make isn't about which model to use, it's about how your application selects models at runtime. Microsoft Foundry Model Router, available through Microsoft Foundry, automatically routes your inference requests to the best available model based on prompt complexity, latency targets, and cost efficiency. But how do you know it's actually routing correctly? And how do you compare its behavior across different API paths? That's exactly the problem RouteLens solves. It's an open-source Node.js CLI and web-based testing tool that sends configurable prompts through two distinct Azure AI runtime paths and produces a detailed comparison of routing decisions, latency profiles, and reliability metrics. In this post, we'll walk through what Model Router does, why it matters, how to use the validator tool, and best practices for designing applications that get the most out of intelligent model routing. What Is Microsoft Founry Model Router? Microsoft Foundry Model Router is a deployment option in Microsoft Foundry that sits between your application and a pool of AI models. Instead of hard-coding a specific model like gpt-4o or gpt-4o-mini , you deploy a Model Router endpoint and let Azure decide which underlying model serves each request. How It Works Your application sends an inference request to the Model Router deployment. Model Router analyzes the request (prompt complexity, token count, required capabilities). It selects the most appropriate model from the available pool. The response is returned transparently — your application code doesn't change. Why This Matters Cost optimization — Simple prompts get routed to smaller, cheaper models. Complex prompts go to more capable (and expensive) ones. Latency reduction — Lightweight prompts complete faster when they don't need a heavyweight model. Resilience — If one model is experiencing high load or throttling, traffic can shift to alternatives. Simplified application code — No need to build your own model-selection logic. The Two Runtime Paths Microsoft Foundry offers two distinct endpoint configurations for hitting Model Router. Even though both use the Chat Completions API, they may have different routing behaviour: Path SDK Endpoint AOAI + Chat Completions OpenAI JS SDK https://.cognitiveservices.azure.com/openai/deployments/ Foundry Project + Chat Completions OpenAI JS SDK (separate client) https://.cognitiveservices.azure.com/openai/deployments/ Understanding whether these two paths produce the same routing decisions is critical for production applications. If the same prompt routes to different models depending on which endpoint you use, that's a signal you need to investigate. Introducing RouteLens RouteLens is a Node.js tool that automates this comparison. It: Sends a configurable set of prompts across categories (echo, summarize, code, reasoning) through both paths. Logs every response to structured JSONL files for post-hoc analysis. Computes statistics including p50/p95 latency, error rates, and model-choice distribution. Highlights routing differences — where the same prompt was served by different models across paths. Provides a web dashboard for interactive testing and real-time result visualization. The Web Dashboard The built-in web UI makes it easy to run tests and explore results without parsing log files: The dashboard includes: KPI Dashboard — Key metrics at a glance: Success Rate, Avg TPS, Gen TPS, Peak TPS, Fastest Response, p50/p95 Latency, Most Reliable Path, Total Tokens Summary view — Per-path/per-category stats with success rate, TPS, and latency Model Comparison — Side-by-side view of which models were selected by each path Latency Charts — Visual bar charts comparing p50 and p95 latencies Error Analysis — Error distribution and detailed error messages Live Feed — Real-time streaming of results as they come in Log Viewer — Browse and inspect historical JSONL log files Model Comparison — See which models were selected by each routing path for every prompt: Live Feed — Real-time streaming of results as they come in: Log Viewer — Browse and inspect historical JSONL log files with parsed table views: Mobile Responsive — The UI adapts to smaller screens: Getting Started Prerequisites Node.js 18+ (LTS recommended) An Azure subscription with a Foundry project Model Router deployed in your Foundry project An API key from your Azure OpenAI / Foundry resource The API version (e.g. 2024-05-01-preview ) Setup # Clone and install git clone https://github.com/leestott/modelrouter-routelens/ cd modelrouter-routelens npm install # Configure your endpoints cp .env.example .env # Edit .env with your Azure endpoints (see below) Configuration The .env file needs these key settings: # Your Foundry / Cognitive Services deployment endpoint # Format: https://<resource>.cognitiveservices.azure.com/openai/deployments/<deployment> # Do NOT include /chat/completions or ?api-version FOUNDRY_PROJECT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router AOAI_BASE_URL=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router # API key from your Azure OpenAI / Foundry resource AOAI_API_KEY=your-api-key-here # Azure OpenAI API version AOAI_API_VERSION=2024-05-01-preview</resource></resource></deployment></resource> Running Tests # Full test matrix — sends all prompts through both paths npm run run:matrix # 408 timeout diagnostic — focuses on the Responses path timeout issue npm run run:repro408 # Web UI — interactive dashboard npm run ui # Then open http://localhost:3002 (or the port set in UI_PORT) Understanding the Results Latency Comparison The latency charts show p50 (median) and p95 (tail) latency for each path and prompt category: Key things to look for: Large p50 differences between paths suggest one path has consistently higher overhead. High p95 values indicate tail latency problems — possibly timeouts or retries. Category-specific patterns — If code prompts are slow on one path but fast on another, that's a routing difference worth investigating. Model Comparison The model comparison view shows which models were selected for each prompt: When both paths select the same model, you see a green "Match" indicator. When they differ, it's flagged in red — these are the cases you want to investigate. Error Analysis The errors view helps diagnose reliability issues: Common error patterns: 408 Timeout — The Responses path may take longer for certain prompt categories 401 Unauthorized — Authentication configuration issues 429 Rate Limited — You're hitting throughput limits 500 Internal Server Error — Backend model issues Best Practices for Designing Applications with Model Router 1. Design Prompts with Routing in Mind Model Router makes decisions based on prompt characteristics. To get the best routing: Keep prompts focused — A clear, single-purpose prompt is easier for the router to classify than a multi-part prompt that spans multiple complexity levels. Use system messages effectively — A well-structured system message helps the router understand the task complexity. Separate complex chains — If you have a multi-step workflow, make each step a separate API call rather than one massive prompt. This lets the router use a cheaper model for simple steps. 2. Set Appropriate Timeouts Different models have different latency profiles. Your timeout settings should account for the slowest model the router might select: // Too aggressive — may timeout when routed to a larger model const TIMEOUT = 5000; // 5s // Better — allows headroom for model variation const TIMEOUT = 30000; // 30s // Best — use different timeouts based on expected complexity function getTimeout(category) { switch (category) { case 'echo': return 10000; case 'summarize': return 20000; case 'code': return 45000; case 'reasoning': return 60000; default: return 30000; } } 3. Implement Robust Retry Logic Because the router may select different models on retry, transient failures can resolve themselves: async function callWithRetry(prompt, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); } catch (err) { if (attempt === maxRetries - 1) throw err; // Exponential backoff await new Promise(r => setTimeout(r, 1000 * Math.pow(2, attempt))); } } } 4. Monitor Model Selection in Production Log which model was selected for each request so you can track routing patterns over time: const response = await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); // The model field in the response tells you which model was actually used console.log(`Routed to: ${response.model}`); console.log(`Tokens: ${response.usage.total_tokens}`); 5. Use the Right API Path for Your Use Case Based on our testing with RouteLens, consider: Chat Completions path — The standard path for chat-style interactions. Uses the openai SDK directly. Foundry Project path — Uses the same Chat Completions API but through the Foundry project endpoint. Useful for comparing routing behaviour across different endpoint configurations. Note: The Responses API ( /responses ) is not currently available on cognitiveservices.azure.com Model Router deployments. Both paths in RouteLens use Chat Completions. 6. Test Before You Ship Run RouteLens as part of your pre-production validation: # In your CI/CD pipeline or pre-deployment check npm run run:matrix -- --runs 10 --concurrency 4 This helps you: Catch routing regressions when Azure updates model pools Verify that your prompt changes don't cause unexpected model selection shifts Establish latency baselines for alerting Architecture Overview RouteLens sends configurable prompts through two distinct Azure AI runtime paths and compares routing decisions, latency, and reliability. The Matrix Runner dispatches prompts to both the Chat Completions Client (OpenAI JS SDK → AOAI endpoint) and the Project Responses Client ( azure/ai-projects → Foundry endpoint). Both paths converge at Azure Model Router, which intelligently selects the optimal backend model. Results are logged to JSONL files and rendered in the web dashboard. Key Benefits of Model Router Benefit Description Cost savings Automatically routes simple prompts to cheaper models, reducing spend by 30-50% in typical workloads Lower latency Simple prompts complete faster on lightweight models Zero code changes Same API contract as a standard model deployment — just change the deployment name Future-proof As Azure adds new models to the pool, your application benefits automatically Built-in resilience Routing adapts to model availability and load conditions Conclusion Azure Model Router represents a shift from "pick a model" to "describe your task and let the platform decide." This is a natural evolution for AI applications — just as cloud platforms abstract away server selection, Model Router abstracts away model selection. RouteLens gives you the visibility to trust that abstraction. By systematically comparing routing behavior across API paths and prompt categories, you can deploy Model Router with confidence and catch issues before your users do. The tool is open source under the MIT license. Try it out, file issues, and contribute improvements: GitHub Repository Model Router Documentation Microsoft Foundry617Views1like0CommentsProvePresent: Ending Proxy Attendance with Azure Serverless & Azure OpenAI
Problem Most schools use a smart‑card‑based attendance system where students tap their cards on a reader. However, this method is unreliable because students can give their cards to friends or simply tap and leave immediately. Teachers cannot accurately assess real student performance—whether high‑performing students are genuinely attending class or whether poor performance is due to actual absence. Another issue is that even if students are physically present in a lecture, teachers still cannot tell whether they are paying attention to the projector or actually learning. The current workaround is for teachers to override the attendance record by calling each student one by one, which is time‑consuming in large lectures and adds little educational value. It is also only a one‑time check, meaning students can still leave the lecture room immediately afterwards. Another issue is that we have many out‑of‑school activities such as site visit, and the school needs to ensure everyone’s presence promptly in each check point. This kind of problem isn’t unique to schools. It’s a common challenge for event organizers, where verifying attendee presence is essential but often slow, causing long queues. Organizers usually rely on a few mobile scanners to check in attendees one by one. Solution ProvePresent is an AI tool designed to verify attendance and create real‑time challenges for participants, ensuring that attendance records are authentic and that attendees remain focused on the presentation. It uses OTP login with school email. Check-in and Check-out With a Real‑time QR Code The code refreshes every 25 seconds, and the presenter can display it on the projector for everyone to scan when checking in at the beginning and checking out at the end of the session. However, this alone cannot prevent someone from capturing the code and sending it to others who are not in the room, or from using two devices to help someone else scan for attendance—even if geolocation checks are enabled. We will explain this next. This check‑in and check‑out process is highly scalable, and no one needs to queue while waiting for someone to scan their QR code! Organizers can set geolocation restrictions to prevent anyone from checking in remotely in a simple manner. Keep Attendee Alive with Signalr The SignalR live connection allows the presenter to create real‑time challenges for attendees, helping to verify their presence and ensure they are genuinely focused on the presentation. AI Powered Live Quiz The presenter shares their presentation screen, and two Microsoft Foundry agents with Azure OpenAI Chatgpt 5.3 —ImageAnalysisAgent, which extracts key information from the shared screen, and QuizQuestionGenerator, which generates simple questions based on the current slide—work together to create challenges. The question is broadcast to all online attendees, who must answer within 20 seconds. This feature keeps attendees on the webpage and prevents them from doing anything unrelated to the presentation. Detailed report can be downloaded for further analysis. Attendee Photo Capture Request all online students to capture and upload photos of their venue view. The system will analyze the images to estimate seating positions using Microsoft Foundry agents with Azure OpenAI ChatGPT 5.3 PositionEstimationAgent and complete an image challenge. When the presenter clicks Capture Attendee Photos, all online attendees are prompted to take a photo and upload it to blob storage. The PositionEstimationAgent then analyzes the image to estimate their seating location, which can provide insights into student performance. Analysis Notes: Analyzed 13 students in 2 overlapping batches. Batch 1: The venue is a computer lab with the projector screen at the front center, whiteboards on the left, and cabinets on the right. Relative depth was estimated mainly from screen size and number of monitor rows visible ahead. Column estimates were inferred from screen angle and side-room features, with lower confidence for the rotated side-view image. Batch 2: These six photos appear to come from the same computer lab with the projector at the front center. Relative depth was estimated mainly from projector size and number of visible desk/monitor rows ahead. Left-right placement was inferred from projector skew and side-wall visibility. Within this batch, 240124734 and 240167285 seem closest to the front, 240286514 and 240158424 are slightly farther back, 240293498 is farther back again, and 240160364 appears furthest. Pass around the QR code attendance sheet Traditionally, the attendance sheet is circulated for attendees to sign, but this method is unreliable because no one monitors the signing process, allowing one attendee to sign for someone who is absent. It is also slow and not scalable for large groups. The QR Code attendance sheet functions as a chain. The presenter randomly distributes a short‑lived, one‑time QR code—representing a virtual attendance sheet—to any number of attendees, just like handing out multiple physical sheets. Each attendee must find another participant to scan their code to record attendance, continuing the chain until the final group of attendees. The presenter then verifies the last group’s presence. The first chain is a dead chain because that student left the venue and cannot find another student to scan his QR code. The second chain contains 20 student attendance records. It also provides useful insights into their friendship and seating patterns. Architecture This project is built using Vibe Coding, so we will not share highly technical details in this post. If you'd like to learn more, leave a comment, and we will write another blog to cover the specifics. GitHub Repo https://github.com/wongcyrus/ProvePresent Conclusion ProvePresent demonstrates how Azure serverless technology and Azure OpenAI can work together to solve a long‑standing problem in education: verifying genuine student presence and engagement. By combining real‑time QR code verification, SignalR‑powered live interactions, AI‑generated quizzes, and intelligent photo‑based seating analysis, we created a system where “being present” is no longer just a checkbox—it becomes a verifiable, interactive, and meaningful part of the learning experience. Instead of relying on outdated smart‑card systems or manual roll calls, educators gain a dynamic tool that keeps students attentive, provides insight into classroom behavior, and produces useful analytics for improving teaching outcomes. Students, in turn, benefit from an engaging, modern attendance experience that aligns with how digital‑native learners expect classes to operate. This is only the beginning. With Microsoft Foundry agents and the flexibility of Azure Functions, there are many opportunities to extend ProvePresent further—richer analytics, smarter engagement models, and seamless integration with LMS platforms. If there’s interest, we’re happy to share more technical details, architectural deep dives, and future roadmap ideas in a follow‑up post. Thank you for the contribution of Microsoft Student Ambassadors Hong Kong Institute of Information Technology (HKIIT) Wong Wing Ho, CHAN Sham Jayson, Pang Ho Shum, and Chan Ka Chun. They are major in Higher Diploma in Cloud and Data Centre Administration. About the Author Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology (HKIIT) @ IVE(Lee Wai Lee).and he focuses on teaching public Cloud technologies. He is a passionate advocate for the adoption of cloud technology across various media and events. With his extensive knowledge and expertise, he has earned prestigious recognitions such as AWS Builder Center, Microsoft MVP- Microsoft Foundry, and Google Developer Expert for Google Cloud Platform & AI.162Views0likes0CommentsStop Drawing Architecture Diagrams Manually: Meet the Open-Source AI Architecture Review Agents
Designing and documenting software architecture is often a battle against static diagrams that become outdated the moment they are drawn. The Architecture Review Agent changes that by turning your design process into a dynamic, AI-powered workflow. In this post, we explore how to leverage Microsoft Foundry Hosted Agents, Azure OpenAI, and Excalidraw to build an open-source tool that instantly converts messy text descriptions, YAML, or README files into editable architecture diagrams. Beyond just drawing boxes, the agent acts as a technical co-pilot, delivering prioritized risk assessments, highlighting single points of failure, and mapping component dependencies. Discover how to eliminate manual diagramming, catch security flaws early, and deploy your own enterprise-grade agent with zero infrastructure overhead.9.4KViews6likes5CommentsHow to Set Up Claude Code with Microsoft Foundry Models on macOS
Introduction Building with AI isn't just about picking a smart model. It is about where that model lives. I chose to route my Claude Code setup through Microsoft Foundry because I needed more than just a raw API. I wanted the reliability, compliance, and structured management that comes with Microsoft's ecosystem. When you are moving from a prototype to something real, having that level of infrastructure backing your calls makes a significant difference. The challenge is that Foundry is designed for enterprise cloud environments, while my daily development work happens locally on a MacBook. Getting the two to communicate seamlessly involved navigating a maze of shell configurations and environment variables that weren't immediately obvious. I wrote this guide to document the exact steps for bridging that gap. Here is how you can set up Claude Code to run locally on macOS while leveraging the stability of models deployed on Microsoft Foundry. Requirements Before we open the terminal, let's make sure you have the necessary accounts and environments ready. Since we are bridging a local CLI with an enterprise cloud setup, having these credentials handy now will save you time later. Azure Subscription with Microsoft Foundry Setup - This is the most critical piece. You need an active Azure subscription where the Microsoft Foundry environment is initialized. Ensure that you have deployed the Claude model you intend to use and that the deployment status is active. You will need the specific endpoint URL and the associated API keys from this deployment to configure the connection. An Anthropic User Account - Even though the compute is happening on Azure, the interface requires an Anthropic account. You will need this to authenticate your session and manage your user profile settings within the Claude Code ecosystem. Claude Code Client on macOS - We will be running the commands locally, so you need the Claude Code CLI installed on your MacBook. Step 1: Install Claude Code on macOS The recommended installation method is via Homebrew or Curl, which sets it up for terminal access ("OS level"). Option A: Homebrew (Recommended) brew install --cask claude-code Option B: Curl curl -fsSL https://claude.ai/install.sh | bash Verify Installation: Run claude --version. Step 2: Set Up Microsoft Foundry to deploy Claude model Navigate to your Microsoft Foundry portal, and find the Claude model catalog, and deploy the selected Claude model. [Microsoft Foundry > My Assets > Models + endpoints > + Deploy Model > Deploy Base model > Search for "Claude"] In your Model Deployment dashboard, go to the deployed Claude Models and get the "Endpoints and keys". Store it somewhere safe, because we will need them to configure Claude Code later on. Configure Environment Variables in MacOS terminal: Now we need to tell your local Claude Code client to route requests through Microsoft Foundry instead of the default Anthropic endpoints. This is handled by setting specific environment variables that act as a bridge between your local machine and your Azure resources. You could run these commands manually every time you open a terminal, but it is much more efficient to save them permanently in your shell profile. For most modern macOS users, this file is .zshrc. Open your terminal and add the following lines to your profile, making sure to replace the placeholder text with your actual Azure credentials: export CLAUDE_CODE_USE_FOUNDRY=1 export ANTHROPIC_FOUNDRY_API_KEY="your-azure-api-key" export ANTHROPIC_FOUNDRY_RESOURCE="your-resource-name" # Specify the deployment name for Opus export CLAUDE_CODE_MODEL="your-opus-deployment-name" Once you have added these variables, you need to reload your shell configuration for the changes to take effect. Run the source command below to update your current session, and then verify the setup by launching Claude: source ~/.zshrc claude If everything is configured correctly, the Claude CLI will initialize using your Microsoft Foundry deployment as the backend. Once you execute the claude command, the CLI will prompt you to choose an authentication method. Select Option 2 (Antrophic Console account) to proceed. This action triggers your default web browser and redirects you to the Claude Console. Simply sign in using your standard Anthropic account credentials. After you have successfully signed in, you will be presented with a permissions screen. Click the Authorize button to link your web session back to your local terminal. Return to your terminal window, and you should see a notification confirming that the login process is complete. Press Enter to finalize the setup. You are now fully connected. You can start using Claude Code locally, powered entirely by the model deployment running in your Microsoft Foundry environment. Conclusion Setting up this environment might seem like a heavy lift just to run a CLI tool, but the payoff is significant. You now have a workflow that combines the immediate feedback of local development with the security and infrastructure benefits of Microsoft Foundry. One of the most practical upgrades is the removal of standard usage caps. You are no longer limited to the 5-hour API call limits, which gives you the freedom to iterate, test, and debug for as long as your project requires without hitting a wall. By bridging your local macOS terminal to Azure, you are no longer just hitting an API endpoint. You are leveraging a managed, compliance-ready environment that scales with your needs. The best part is that now the configuration is locked in, you don't need to think about the plumbing again. You can focus entirely on coding, knowing that the reliability of an enterprise platform is running quietly in the background supporting every command.887Views1like0CommentsBuilding with Azure OpenAI Sora: A Complete Guide to AI Video Generation
In this comprehensive guide, we'll explore how to integrate both Sora 1 and Sora 2 models from Azure OpenAI Service into a production web application. We'll cover API integration, request body parameters, cost analysis, limitations, and the key differences between using Azure AI Foundry endpoints versus OpenAI's native API. Table of Contents Introduction to Sora Models Azure AI Foundry vs. OpenAI API Structure API Integration: Request Body Parameters Video Generation Modes Cost Analysis per Generation Technical Limitations & Constraints Resolution & Duration Support Implementation Best Practices Introduction to Sora Models Sora is OpenAI's groundbreaking text-to-video model that generates realistic videos from natural language descriptions. Azure AI Foundry provides access to two versions: Sora 1: The original model focused primarily on text-to-video generation with extensive resolution options (480p to 1080p) and flexible duration (1-20 seconds) Sora 2: The enhanced version with native audio generation, multiple generation modes (text-to-video, image-to-video, video-to-video remix), but more constrained resolution options (720p only in public preview) Azure AI Foundry vs. OpenAI API Structure Key Architectural Differences Sora 1 uses Azure's traditional deployment-based API structure: Endpoint Pattern: https://{resource-name}.openai.azure.com/openai/deployments/{deployment-name}/... Parameters: Uses Azure-specific naming like n_seconds, n_variants, separate width/height fields Job Management: Uses /jobs/{id} for status polling Content Download: Uses /video/generations/{generation_id}/content/video Sora 2 adapts OpenAI's v1 API format while still being hosted on Azure: Endpoint Pattern: https://{resource-name}.openai.azure.com/openai/deployments/{deployment-name}/videos Parameters: Uses OpenAI-style naming like seconds (string), size (combined dimension string like "1280x720") Job Management: Uses /videos/{video_id} for status polling Content Download: Uses /videos/{video_id}/content Why This Matters? This architectural difference requires conditional request formatting in your code: const isSora2 = deployment.toLowerCase().includes('sora-2'); if (isSora2) { requestBody = { model: deployment, prompt, size: `${width}x${height}`, // Combined format seconds: duration.toString(), // String type }; } else { requestBody = { model: deployment, prompt, height, // Separate dimensions width, n_seconds: duration.toString(), // Azure naming n_variants: variants, }; } API Integration: Request Body Parameters Sora 1 API Parameters Standard Text-to-Video Request: { "model": "sora-1", "prompt": "Wide shot of a child flying a red kite in a grassy park, golden hour sunlight, camera slowly pans upward.", "height": "720", "width": "1280", "n_seconds": "12", "n_variants": "2" } Parameter Details: model (String, Required): Your Azure deployment name prompt (String, Required): Natural language description of the video (max 32000 chars) height (String, Required): Video height in pixels width (String, Required): Video width in pixels n_seconds (String, Required): Duration (1-20 seconds) n_variants (String, Optional): Number of variations to generate (1-4, constrained by resolution) Sora 2 API Parameters Text-to-Video Request: { "model": "sora-2", "prompt": "A serene mountain landscape with cascading waterfalls, cinematic drone shot", "size": "1280x720", "seconds": "12" } Image-to-Video Request (uses FormData): const formData = new FormData(); formData.append('model', 'sora-2'); formData.append('prompt', 'Animate this image with gentle wind movement'); formData.append('size', '1280x720'); formData.append('seconds', '8'); formData.append('input_reference', imageFile); // JPEG/PNG/WebP Video-to-Video Remix Request: Endpoint: POST .../videos/{video_id}/remix Body: Only { "prompt": "your new description" } The original video's structure, motion, and framing are reused while applying the new prompt Parameter Details: model (String, Optional): Your deployment name prompt (String, Required): Video description size (String, Optional): Either "720x1280" or "1280x720" (defaults to "720x1280") seconds (String, Optional): "4", "8", or "12" (defaults to "4") input_reference (File, Optional): Reference image for image-to-video mode remix_video_id (String, URL parameter): ID of video to remix Video Generation Modes 1. Text-to-Video (Both Models) The foundational mode where you provide a text prompt describing the desired video. Implementation: const response = await fetch(endpoint, { method: 'POST', headers: { 'Content-Type': 'application/json', 'api-key': apiKey, }, body: JSON.stringify({ model: deployment, prompt: "A train journey through mountains with dramatic lighting", size: "1280x720", seconds: "12", }), }); Best Practices: Include shot type (wide, close-up, aerial) Describe subject, action, and environment Specify lighting conditions (golden hour, dramatic, soft) Add camera movement if desired (pans, tilts, tracking shots) 2. Image-to-Video (Sora 2 Only) Generate a video anchored to or starting from a reference image. Key Requirements: Supported formats: JPEG, PNG, WebP Image dimensions must exactly match the selected video resolution Our implementation automatically resizes uploaded images to match Implementation Detail: // Resize image to match video dimensions const targetWidth = parseInt(width); const targetHeight = parseInt(height); const resizedImage = await resizeImage(inputReference, targetWidth, targetHeight); // Send as multipart/form-data formData.append('input_reference', resizedImage); 3. Video-to-Video Remix (Sora 2 Only) Create variations of existing videos while preserving their structure and motion. Use Cases: Change weather conditions in the same scene Modify time of day while keeping camera movement Swap subjects while maintaining composition Adjust artistic style or color grading Endpoint Structure: POST {base_url}/videos/{original_video_id}/remix?api-version=2024-08-01-preview Implementation: let requestEndpoint = endpoint; if (isSora2 && remixVideoId) { const [baseUrl, queryParams] = endpoint.split('?'); const root = baseUrl.replace(/\/videos$/, ''); requestEndpoint = `${root}/videos/${remixVideoId}/remix${queryParams ? '?' + queryParams : ''}`; } Cost Analysis per Generation Sora 1 Pricing Model Base Rate: ~$0.05 per second per variant at 720p Resolution Scaling: Cost scales linearly with pixel count Formula: const basePrice = 0.05; const basePixels = 1280 * 720; // Reference resolution const currentPixels = width * height; const resolutionMultiplier = currentPixels / basePixels; const totalCost = basePrice * duration * variants * resolutionMultiplier; Examples: 720p (1280×720), 12 seconds, 1 variant: $0.60 1080p (1920×1080), 12 seconds, 1 variant: $1.35 720p, 12 seconds, 2 variants: $1.20 Sora 2 Pricing Model Flat Rate: $0.10 per second per variant (no resolution scaling in public preview) Formula: const totalCost = 0.10 * duration * variants; Examples: 720p (1280×720), 4 seconds: $0.40 720p (1280×720), 12 seconds: $1.20 720p (720×1280), 8 seconds: $0.80 Note: Since Sora 2 currently only supports 720p in public preview, resolution doesn't affect cost, only duration matters. Cost Comparison Scenario Sora 1 (720p) Sora 2 (720p) Winner 4s video $0.20 $0.40 Sora 1 12s video $0.60 $1.20 Sora 1 12s + audio N/A (no audio) $1.20 Sora 2 (unique) Image-to-video N/A $0.40-$1.20 Sora 2 (unique) Recommendation: Use Sora 1 for cost-effective silent videos at various resolutions. Use Sora 2 when you need audio, image/video inputs, or remix capabilities. Technical Limitations & Constraints Sora 1 Limitations Resolution Options: 9 supported resolutions from 480×480 to 1920×1080 Includes square, portrait, and landscape formats Full list: 480×480, 480×854, 854×480, 720×720, 720×1280, 1280×720, 1080×1080, 1080×1920, 1920×1080 Duration: Flexible: 1 to 20 seconds Any integer value within range Variants: Depends on resolution: 1080p: Variants disabled (n_variants must be 1) 720p: Max 2 variants Other resolutions: Max 4 variants Concurrent Jobs: Maximum 2 jobs running simultaneously Job Expiration: Videos expire 24 hours after generation Audio: No audio generation (silent videos only) Sora 2 Limitations Resolution Options (Public Preview): Only 2 options: 720×1280 (portrait) or 1280×720 (landscape) No square formats No 1080p support in current preview Duration: Fixed options only: 4, 8, or 12 seconds No custom durations Defaults to 4 seconds if not specified Variants: Not prominently supported in current API documentation Focus is on single high-quality generations with audio Concurrent Jobs: Maximum 2 jobs (same as Sora 1) Job Expiration: 24 hours (same as Sora 1) Audio: Native audio generation included (dialogue, sound effects, ambience) Shared Constraints Concurrent Processing: Both models enforce a limit of 2 concurrent video jobs per Azure resource. You must wait for one job to complete before starting a third. Job Lifecycle: queued → preprocessing → processing/running → completed Download Window: Videos are available for 24 hours after completion. After expiration, you must regenerate the video. Generation Time: Typical: 1-5 minutes depending on resolution, duration, and API load Can occasionally take longer during high demand Resolution & Duration Support Matrix Sora 1 Support Matrix Resolution Aspect Ratio Max Variants Duration Range Use Case 480×480 Square 4 1-20s Social thumbnails 480×854 Portrait 4 1-20s Mobile stories 854×480 Landscape 4 1-20s Quick previews 720×720 Square 4 1-20s Instagram posts 720×1280 Portrait 2 1-20s TikTok/Reels 1280×720 Landscape 2 1-20s YouTube shorts 1080×1080 Square 1 1-20s Premium social 1080×1920 Portrait 1 1-20s Premium vertical 1920×1080 Landscape 1 1-20s Full HD content Sora 2 Support Matrix Resolution Aspect Ratio Duration Options Audio Generation Modes 720×1280 Portrait 4s, 8s, 12s ✅ Yes Text, Image, Video Remix 1280×720 Landscape 4s, 8s, 12s ✅ Yes Text, Image, Video Remix Note: Sora 2's limited resolution options in public preview are expected to expand in future releases. Implementation Best Practices 1. Job Status Polling Strategy Implement adaptive backoff to avoid overwhelming the API: const maxAttempts = 180; // 15 minutes max let attempts = 0; const baseDelayMs = 3000; // Start with 3 seconds while (attempts < maxAttempts) { const response = await fetch(statusUrl, { headers: { 'api-key': apiKey }, }); if (response.status === 404) { // Job not ready yet, wait longer const delayMs = Math.min(15000, baseDelayMs + attempts * 1000); await new Promise(r => setTimeout(r, delayMs)); attempts++; continue; } const job = await response.json(); // Check completion (different status values for Sora 1 vs 2) const isCompleted = isSora2 ? job.status === 'completed' : job.status === 'succeeded'; if (isCompleted) break; // Adaptive backoff const delayMs = Math.min(15000, baseDelayMs + attempts * 1000); await new Promise(r => setTimeout(r, delayMs)); attempts++; } 2. Handling Different Response Structures Sora 1 Video Download: const generations = Array.isArray(job.generations) ? job.generations : []; const genId = generations[0]?.id; const videoUrl = `${root}/${genId}/content/video`; Sora 2 Video Download: const videoUrl = `${root}/videos/${jobId}/content`; 3. Error Handling try { const response = await fetch(endpoint, fetchOptions); if (!response.ok) { const error = await response.text(); throw new Error(`Video generation failed: ${error}`); } // ... handle successful response } catch (error) { console.error('[VideoGen] Error:', error); // Implement retry logic or user notification } 4. Image Preprocessing for Image-to-Video Always resize images to match the target video resolution: async function resizeImage(file: File, targetWidth: number, targetHeight: number): Promise<File> { return new Promise((resolve, reject) => { const img = new Image(); const canvas = document.createElement('canvas'); const ctx = canvas.getContext('2d'); img.onload = () => { canvas.width = targetWidth; canvas.height = targetHeight; ctx.drawImage(img, 0, 0, targetWidth, targetHeight); canvas.toBlob((blob) => { if (blob) { const resizedFile = new File([blob], file.name, { type: file.type }); resolve(resizedFile); } else { reject(new Error('Failed to create resized image blob')); } }, file.type); }; img.onerror = () => reject(new Error('Failed to load image')); img.src = URL.createObjectURL(file); }); } 5. Cost Tracking Implement cost estimation before generation and tracking after: // Pre-generation estimate const estimatedCost = calculateCost(width, height, duration, variants, soraVersion); // Save generation record await saveGenerationRecord({ prompt, soraModel: soraVersion, duration: parseInt(duration), resolution: `${width}x${height}`, variants: parseInt(variants), generationMode: mode, estimatedCost, status: 'queued', jobId: job.id, }); // Update after completion await updateGenerationStatus(jobId, 'completed', { videoId: finalVideoId }); 6. Progressive User Feedback Provide detailed status updates during the generation process: const statusMessages: Record<string, string> = { 'preprocessing': 'Preprocessing your request...', 'running': 'Generating video...', 'processing': 'Processing video...', 'queued': 'Job queued...', 'in_progress': 'Generating video...', }; onProgress?.(statusMessages[job.status] || `Status: ${job.status}`); Conclusion Building with Azure OpenAI's Sora models requires understanding the nuanced differences between Sora 1 and Sora 2, both in API structure and capabilities. Key takeaways: Choose the right model: Sora 1 for resolution flexibility and cost-effectiveness; Sora 2 for audio, image inputs, and remix capabilities Handle API differences: Implement conditional logic for parameter formatting and status polling based on model version Respect limitations: Plan around concurrent job limits, resolution constraints, and 24-hour expiration windows Optimize costs: Calculate estimates upfront and track actual usage for better budget management Provide great UX: Implement adaptive polling, progressive status updates, and clear error messages The future of AI video generation is exciting, and Azure AI Foundry provides production-ready access to these powerful models. As Sora 2 matures and limitations are lifted (especially resolution options), we'll see even more creative applications emerge. Resources: Azure AI Foundry Sora Documentation OpenAI Sora API Reference Azure OpenAI Service Pricing This blog post is based on real-world implementation experience building LemonGrab, my AI video generation platform that integrates both Sora 1 and Sora 2 through Azure AI Foundry. The code examples are extracted from production usage.663Views0likes0CommentsMicrosoft Foundry - Everything you need to build AI apps & agents
Our unified, interoperable AI platform enables developers to build faster and smarter, while organizations gain fleetwide security and governance in a unified portal. Yina Arenas, Microsoft Foundry CVP, shares how to keep your development and operations teams coordinated, ensuring productivity, governance, and visibility across all your AI projects. Learn more in this Microsoft Mechanics demo, and start building with Microsoft Foundry at ai.azure.com Feed your agents multiple trusted data sources. For accurate, contextual responses, get started with Microsoft Foundry. Start here. Apply safety & security guardrails. Ensure responsible AI behavior. Check it out. Keep your AI apps running smoothly. Deploy agents to Teams and Copilot Chat, then monitor performance and costs in Microsoft Foundry. See how it works. QUICK LINKS: 00:54 — Tour the Microsoft Foundry portal 03:32 — The Build tab and Workflows 05:03 — How to build an agentic app 07:02 — Evaluate agent performance 08:37 — Safety and security 09:18 — Publish your agentic app 09:41 — Post deployment 11:36 — Wrap up Link References Visit https://ai.azure.com and get started today Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: -If you are building AI apps and agents and want to move faster with more control, the newly expounded Foundry helps you do exactly that, while integrating directly with your code. It works like a unified AI app and agent factory, with rich tooling and observability. A simple developer experience helps you and your team find the right components you need to start building your agents and move seamlessly from idea all the way to production. It is augmented by powerful new capabilities, such as an agent framework for multi-agentic apps and workflow automation, or multisource knowledge-based creation to support deep reasoning. New levels of observability across your fleet of agents then help you evaluate how well they’re operating. And it is easier than ever to ensure security and safety controls are in place to support the right level of trust and much more. -Let’s tour the new Microsoft Foundry portal while we build an agentic app. We’ll play the role of a clothing company using AI to research new market opportunities. The homepage at ai.azure.com guides you right through a build experience. It’s simple to start building, to create an agent, design a workflow, and browse available AI models right from here. Alternatively, you can quickly copy the project endpoint, the key, and the region to use it directly in your code with the Microsoft Foundry SDK. One of the most notable improvements is how everything you need to do is aligned to the development lifecycle. -If you are just getting started, the Discovery tab makes it simple to find everything you need. Feature models are front and center, from OpenAI, Grok, Meta, DeepSeek, Mistral AI, and now for the first time, Anthropic. You can also browse model collections, including models that you can run from your local device from Foundry Local. Model Leaderboard then helps you reference how the top models compare across quality, safety, throughput, and cost. And you’ll see the feature tools, including MCP servers, that you can connect to. Then moving to the left nav, in Agents, you can find samples for different standalone agent types to quickly get you up and running. -In Models, you can browse a massive industry-leading catalog of thousand of foundational open source and specialized models. Click any model to see its capabilities, like this one for GPT-5 Chat. Then clicking into Deploy, we can try it out from here. I’ll add a prompt: “What is a must-have apparel for the fall in the Pacific Northwest?” Now, looking at its generated response with recommendations for outerwear, it looks like GPT-5 Chat knows that it rains quite a bit here. If I move back to the catalog view, we can also see the new model router that automatically routes prompts to the most efficient models in real time, ensuring high-quality results while minimizing costs. I already have it deployed here and ready to use. -Under Tools, you’ll find all of the available tools that you can use to connect your agents and apps. You can easily find MCP servers and more than a thousand connectors to add to your workflows. You can add them from here or right as you’re building your agent. Next, to accelerate your efforts, you can access dozens of curated solution templates with step-by-step instructions for coding AI right into your apps. These are customizable code samples with preintegrated Azure services and GitHub-hosted quickstart guides for different app types. So there are plenty of components to discover while designing your agent. -Next, the Build tab brings powerful new capabilities, whether you’re creating a single agent or a multi-agentic solution. Build is where you manage the assets you own: agents, workflows, models, tools, knowledge and more. And straightaway it’s easy to get to all your current agents or create new ones. I have a few here already that I’ll be calling later to support our multi-agentic app, including this research agent. In Workflows, you can create and see all your multi-agentic apps and workflow automations. -To get started, you can pick from different topologies such as Sequential, Human in the Loop, or Group Chat and more. I have a few here, including this one for research that we’ll use in our agentic app. We’ll go deeper on this in just a moment. As you continue building your app, your deployed models can be viewed in context. Here’s the model router that we saw before. And then further down the left rail you’ll find fine-tuning options where you can customize model behavior and outputs using supervised learning, direct preference optimization, and reinforcement techniques. Under the Tools, it’s easy to see which ones are already connected to your environment. Knowledge then allows you to add knowledge bases from Foundry IQ so you can bring not just one but multiple sources, including SharePoint online, OneLake, which is part of Microsoft Fabric, and your search index to ground your agents. -And in Data, you can create synthetic datasets, which are very handy for fine-tuning and evaluation. Now that we have the foundational ingredients for our agentic app collected, let’s actually build it. I’ll start with a multi-agent workflow that my team is working on. Workflows are also a type of agent with similar constructs for development, deployment, and the management, and they can contain their own logic as well as other agents. The visualizer lets you easily define and view the nodes in the workflow, as well as all connected agents. You can apply conditions like this to a workflow step. Here we’re assessing the competitiveness of the insights generated as we research opportunities for market expansion. -There is also a go-to loop. If the insights are not competitive, we’ll iterate on this step. For many of these connectors, you can add agents. I’m going to add an existing agent after the procurement researcher. I’ll choose an agent that we’ve already started working on, the research agent, and jump into the editor. Note that the Playground tab is the starting point for all agents that you create. You can choose the model you want. I’ll choose GPT-5 Chat and then provide the agent with instructions. I’ll add mine here with high-level details for what the agent should do. Below that, in Tools, you can see that my research agent is already connected to our internal SharePoint site in Microsoft 365. I can also add knowledge bases to ground responses right from here. I can turn on memory for my agent to retain notable context and apply guardrails for safety and security controls. I’ll show you more on that later. Agents are also multimodel, including voice, which is great for mobile apps. Using voice, I’ll prompt it with: “What industry is Zava Corp in, and what goods does it produce?” -[AI] Zava Corporation operates in the apparel industry. It focuses on producing a wide range of clothing and fashion-related goods. -Next, I’ll type in a text prompt, and that will retrieve content from our SharePoint site to generate its response. And importantly, as I make these changes to my agent, it will now automatically version them, and I can always revert to a previous version. Then as the build phase continues, it’s easy to evaluate agent performance. -In Evaluations, I can see all my agent runs. I’ve already started creating an evaluation for our agent using synthetic data to check that we are hitting our goals for output quality and safety. From the Agent, we can review its runs and traces to diagnose latency bottlenecks. And under the Evaluation tab, you can see that our AI quality and safety scores could be better. Using these insights, let’s update our agent and make improvements. Everything shown in the web portal can also be done with code. So let’s do this update in VS Code. This is the same multi-agentic workflow I showed you before, with all of its logic now represented in code. The folders on the left rail represent our different agents, and the workflow structure describes the multi-agent reasoning process. It’s designed to take incoming requests and route them to the relevant expert agent to complete the tasks. We have an intent classifier agent, a procurement researcher, the market researcher one that we just built, and two more with expertise in negotiation and review. -And the workflow is connected to a knowledge base with multiple sources to inform agentic responses. This includes a search index for supplier information, relevant financial data from Microsoft Fabric, product data from SharePoint, and we can connect to available MCP servers like this one from GitHub. Having this rich multisource knowledge base feeding our agentic workflow should ensure more accurate results. In fact, if we look at the evaluation for this workflow, you will see that AI quality is a lot higher overall. But we still have to do some work on safety. We’ll address this by adding the right safety and security controls right from Microsoft Foundry. For that, we’ll head over to Guardrails where you can apply controls based on specific AI risks. -I’ll target jailbreak attack, and then I can apply additional associated controls like content safety and protected materials to ensure our agents also behave responsibly. And I can scope what this guardrail should govern: either a model or an agent; or in my case, I’ll select our workflow to address the low safety score that we saw earlier. And with that, it’s ready to publish. In fact, we’ve made it easier to get your apps and agents into the productivity tools that people use every day. I can publish our agentic app directly into Microsoft Teams and Copilot Chat right from our workflow. And once it is approved by the Microsoft 365 admin, business users can find it in the Agent Store and pin it for easy access. Now, with everything in production, your developer and operation teams can continue working together in Microsoft Foundry, post-deployment and beyond. -The Operate tab has the full Foundry control plane. In the overview, you can quickly monitor key operational metrics and spot what needs your attention. This is a full cross-fleet view of your agents. You can also filter by subscription and then by project if you want. The top active alerts are listed right here for me to take action. And I can optionally view all alerts if I want, along with rollout metrics for estimated cost, agent success rates, and total token usage. Below that, we can see the details of agent runs of our time, along with top- and bottom-performing agents with trends for each. All performance data is built on open telemetry standards that can be easily surfaced inside Azure Monitor or your favorite reporting tool. -Next, under Assets, for every agent, model, and tool in your environment, you can see metrics like status, error rates, estimated cost, token usage, and number of runs. This gives you a quick pulse on performance activity and health for each asset. And you can click in for more details if you want to. Compliance then lets IT teams view and set default policies by AI risk for any asset created. You can add controls and then scope it by the entire subscription or resource group. That way they will automatically inherit governance controls. Under Quota, you can keep all of your costs in check while ensuring that your AI applications and agents stay within your token limits. And finally, under Admin, you can find all of your resources and related configuration controls for each project in one place, and click in to manage roles and access. If you go back, the newly integrated AI gateways also allow you to connect and manage agents, even from other clouds. -So that’s how the expanded Microsoft Foundry simplifies the development and operations experience to help you and your team build powerful AI apps and agents faster, with more control, while integrated directly into your code. Visit ai.azure.com to learn more and get started today. Keep watching Microsoft Mechanics for the latest tech updates, and subscribe if you haven’t already. Thanks for watching.1.2KViews0likes0CommentsFoundry IQ for Multi-Source AI Knowledge Bases
Pull from multiple sources at once, connect the dots automatically, and getvaccurate, context-rich answers without doing manual orchestration with Foundry IQ in Microsoft Foundry. Navigate complex, distributed data across Azure stores, SharePoint, OneLake, MCP servers, and even the web, all through a single knowledge base that handles query planning and iteration for you. Reuse the Azure AI Search assets you already have, build new knowledge bases with minimal setup, and control how much reasoning effort your agents apply. As you develop, you can rely on iterative retrieval only when it improves results, saving time, tokens, and development complexity. Pablo Castro, Azure AI Search CVP and Distinguished Engineer, joins Jeremy Chapman to share how to build smarter, more capable AI agents, with higher-quality grounded answers and less engineering overhead. Smart, accurate responses. Give your agents the ability to search across multiple sources automatically without extra development work. Check out Foundry IQ in Microsoft Foundry. Build AI agents fast. Organize your data, handle query planning, and orchestrate retrieval automatically. Get started using Foundry IQ knowledge bases. Save time and resources while keeping answers accurate. Foundry IQ decides when to iterate or exit, optimizing efficiency. Take a look. QUICK LINKS: 00:00 — Foundry IQ in Microsoft Foundry 01:02 — How it’s evolved 03:02 — Knowledge bases in Foundry IQ 04:37 — Azure AI Search and retrieval stack 05:51 — How it works 06:52 — Visualization tool demo 08:07 — Build a knowledge base 10:10 — Evaluating results 13:11 — Wrap up Link References To learn more check out https://aka.ms/FoundryIQ For more details on the evaluation metric discussed on this show, read our blog at https://aka.ms/kb-evals For more on Microsoft Foundry go to https://ai.azure.com/nextgen Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: - If you research any topic, do you stop after one knowledge source? That’s how most AI will typically work today to generate responses. Instead, now with Foundry IQ in Microsoft Foundry, built-in AI powered query decomposition and orchestration make it easy for your agents to find and retrieve the right information across multiple sources, autonomously iterating as much as required to generate smarter and more relevant responses than previously possible. And the good news is, as a developer, this all just works out of the box. And joining me to unpack everything and also show a few demonstrations of how it works is Pablo Castro, distinguished engineer and also CVP. He’s also the architect of Azure AI Search. So welcome back to the show. - It’s great to be back. - And you’ve been at the forefront really for AI knowledge retrieval really since the beginning, where Azure AI Search is Microsoft’s state-of-the-art search engine for vector and hybrid retrieval, and this is really key to building out things like RAG-based agentic services and applications. So how have things evolved since then? - Things are changing really fast. Now, AI and agents in particular, are expected to navigate the reality of enterprise information. They need to pull data across multiple sources and connect the dots as they automate tasks. This data is all over the place, some in Azure stores, some in SharePoint, some is public data on the web, anywhere you can think of. Up until now, AI applications that needed to ground agents on external knowledge typically used as single index. If they needed to use multiple data sources, it was up to the developer to orchestrate them. With Foundry IQ and the underlying Azure AI Search retrieval stack, we tackled this whole problem. Let me show you. Here is a technician support agent that I built. It’s pointed at a knowledge base with information from different sources that we pull together in Foundry IQ. It provides our agent with everything it needs to know as it provides support to onsite technicians. Let’s try it. I’ll ask a really convoluted question, more of a stream of thought that someone might ask when working on a problem. I’ll paste in: “Equipment not working, CTL11 light is red, “maybe power supply problem? “Label on equipment says P4324. “The cord has another label UL 817. “Okay to replace the part?” From here, the agent will give the question to the knowledge base, and the knowledge base will figure out which knowledge sources to consult before coming back with a comprehensive answer. So how did it answer this particular question? Well, we can see it went across three different data sources. The functionality of the CTL11 indicator is from the machine manuals. We received them from different machine vendors, and we have them all stored in OneLake. Then, the company policy for repairs, which our company regularly edits, lives in SharePoint. And finally, the agent retrieved public information from the web to determine electrical standards. - And really, the secret sauce behind all of this is the knowledge base. So can you explain what that is and how that works? - So yeah, knowledge bases are first class artifacts in Foundry IQ. Think of a knowledge base as the encapsulation of an information domain, such as technical support in our example. A knowledge base comprises one or more data sources that can live anywhere. And it has its own AI models for retrieval orchestration against those sources. When a query comes in, a planning step is run. Here, the query is deconstructed. The AI model refers to the source description or retrieval instructions provided, and it connects the different parts of the query to the appropriate knowledge source. It then runs the queries, and it looks at the results. A fast, fine-tuned SLM then assesses whether we have enough information to exit or if we need more information and should iterate by running the planning step again. Once it has a high level of confidence in the response, it’ll return the results to the agent along with the source information for citations. Let’s open the knowledge base for our technician support agent. And at the bottom, you can see our three different knowledge sources. Again, machine specs pulls markdown files from OneLake with all the equipment manuals. And notice the source description which Foundry IQ uses during query planning. Policies points at our SharePoint site with our company repair policies. And here’s the web source for public information. And above, I’ve also provided retrieval instructions in natural language. Here, for example, I explicitly call out using web for electrical and industry standards. - And you’re in Microsoft Foundry, but you also mentioned that Azure AI Search and the retrieval stack are really the underpinnings for Foundry IQ. So, what if I already have some Azure AI Search running in my case? - Sure. Knowledge bases are actually AI search artifacts. You can still use standalone AI search and access these capabilities. Let me show you what it looks like in the Azure portal and in code. Here, I’m in my Azure AI Search service. We can see existing knowledge bases, and here’s the knowledge base we were using in Foundry IQ. Flipping to VS code, we have a new KnowledgeBaseRetrievalClient. And if you’ve used Azure AI Search before, this is similar to the existing search client but focused on the agentic retrieval functionality. Let me run the retrieve step. The retrieve method takes a set of queries or a list of messages from a conversation and returns a response along with references. And here are the results in detail, this time purely using the Azure AI Search API. If you’re already using Azure AI Search, you can create knowledge bases in your existing services and even reuse your existing indexes. Layering things this way lets us deliver the state-of-the-art retrieval quality that Azure AI Search is known for, combined with the power of knowledge bases and agentic retrieval. - Now that we understand some of the core concepts behind knowledge bases, how does it actually work then under the covers? - Well, unlike the classic RAG technique that we typically use one source with one index, we can use one or more indexes as well as remote sources. When you construct a knowledge base, passive data sources, such as files in OneLake or Azure Blob Storage are indexed, meaning that Azure Search creates vector and keyword indexes by ingesting and processing the data from the source. We also give you the option to create indexes for specific SharePoint sites that you define while propagating permissions and labels. On the other hand, data sources like the web or MCP servers are accessed remotely, and we support remote access mode for SharePoint too. In these cases, we’ll effectively use the index for the connected source for data for retrieval. Surrounding those knowledge sources, we have an agentic retrieval engine powered by an ensemble of models to run the end-to-end query process that is used to find information. I wrote a small visualization tool to show you what’s going on during the retrieval process. Let me show you. I’ll paste the same query we used before and just hit run. This uses the Azure AI Search knowledge base API directly to run retrieval and return both the results and details of each step. Now in the return result, we can see it did two iterations and issued 15 queries total across three knowledge sources. This is work a person would’ve had to do manually while researching. In this first iteration, we can see it broke the question apart into three aspects, equipment details, the meaning of the label, and the associated policy, and it ran those three as queries against a selected set of knowledge sources. Then, the retrieval engine assessed that some information was missing, so it iterated and issued a second round of searches to complete the picture. Finally, we can see a summary of how much effort we put in, in tokens, along with an answer synthesis step, where it provided a complete answer along with references. And at the bottom, we can see all the reference data used to produce the answer was also returned. This is all very powerful, because as a developer, you just need to create a knowledge base with the data sources you need, connect your agent to it, and Foundry IQ takes care of the rest. - So, how easy is it then to build a knowledge base out like this? - This is something we’ve worked really hard on to reduce the complexity. We built a powerful and simplified experience in Foundry. Starting in the Foundry portal, I’ll go to Build, then to Knowledge in the left nav and see all the knowledge bases I already created. Just to show you the options, I’ll create a new one. Here, you can choose from different knowledge sources. In this case, I’ll cancel out of this and create a new one from scratch. We’ll give it a name, say repairs, and choose a model that’s used for planning and synthesis and define the retrieval reasoning effort. This allows you to control the time and effort the system will put into information retrieval, from minimum where we just retrieve from all the sources without planning to higher levels of effort, where we’ll do multiple iterations assessing whether we got the right results. Next, I’ll set the output mode to answer synthesis, which tells the knowledge base to take the grounding information it’s collected and compose a consolidated answer. Then I can add the knowledge sources we created earlier, and for example, I’ll reduce the machine specs that contains the manuals that are in OneLake and our policies from SharePoint. If I want to create a new knowledge source, I can choose supported stores in this list. For example, if I choose blob storage, I just need to point at the storage account and container, and Foundry IQ will pull all the documents, the chunking, vectorization, and everything needed to make it ready to use. We’ll leave things as is for now. Instead, something really cool is how we also support MCP servers as knowledge sources. Let’s create a quick one. Let’s say we want to pull software issues from GitHub. All I need to do is point it to the GitHub MCP server address and set search_issues as the tool name. At this point, I’m all set, and I just need to save my changes. If data needs to be indexed for some of my knowledge sources, that will happen in the background, and indexes are continually updated with fresh information. - And to be clear, this is hiding a ton of complexity, but how do we know it’s actually working better than previous ways for retrieval? - Well, as usual, we’ve done a ton of work on evaluations. First, we measured whether the agentic approach is better than just searching for all the sources and combining the results. In this study, the grey lines represent the various data sets we used in this evaluation, and when using query planning and iterative search, we saw an average 36% gain in answer score as represented by this green line. We also tested how effective it is to combine multiple private knowledge sources and also a mix of private sources with web search where public data can fill in the gaps when internal information falls short. We first spread information across nine knowledge sources and measure the answer score, which landed at 90%, showing just how effective multi-source retrieval is. We then removed three of the nine sources, and as expected, the answer score dropped to about 50%. Then, we added a web knowledge source to compensate for where our six internal sources were lacking, which in this case was publicly available information, and that boosted results significantly. We achieved a 24-point increase for low-retrieval reasoning effort and 34 points for medium effort. Finally, we wanted to make sure we only iterate if it’ll make things better. Otherwise, we want to exit the agentic retrieval loop. Again, under the covers, Foundry IQ uses two models to check whether we should exit, a fine-tuned SLM to do a fast check with a high bar, and if there is doubt, then we’ll use a full LLM to reassess the situation. In this table, on the left, we can see the various data sets used in our evaluation along with the type of knowledge source we used. The fast check and the full check columns indicate the number of times as a percentage that each of the models decided that we should exit the agentic retrieval loop. We need to know if it was a good idea to actually exit. So the last column has the answer score you would get if you use the minimal retrieval left for setting, where there is no iteration or query planning. If this score is high, iteration isn’t needed, and if it’s low, iteration could have improved the answer score. You can see, for example, in the first row, the answer score is great without iteration. Both fast and full checks show a high percentage of exits. In each of these, we saved time and tokens. The middle three rows are cases where the fast check, the first to the full check, and the full check predicts that we should exit at reasonable high percentages, which is consistent with the relatively high answers scores for minimal effort. Finally, the last two rows show both models wanting to iterate again most of the time, consistent with the low answer score you would’ve seen without iteration. So as you saw, the exit assessment approach in Foundry IQ orchestration is effective, saving time and tokens while ensuring high quality results. - Foundry IQ then is great for connecting the dots then across scattered information while keeping your agents simple to build, and there’s no orchestration required. It’s all done for you. So, how can people try Foundry IQ for themselves right now? - It’s available now in public preview. You can check it out at aka.ms/FoundryIQ. - Thanks so much again for joining us today, Pablo, and thank you for watching. Be sure to subscribe to Microsoft Mechanics for more updates, and we’ll see you again soon.3.2KViews0likes0CommentsRun local AI on any PC or Mac — Microsoft Foundry Local
Leverage full hardware performance, keep data private, reduce latency, and predict costs, even in offline or low-connectivity scenarios. Simplify development and deploy AI apps across diverse hardware and OS platforms with the Foundry Local SDK. Manage models locally, switch AI engines easily, and deliver consistent, multi-modal experiences, voice or text, without complex cross-platform setup. Raji Rajagopalan, Microsoft CoreAI Vice President, shares how to start quickly, test locally, and scale confidently. No cloud needed. Build AI apps once and run them locally on Windows, macOS, & mobile. Get started with Foundry Local SDK. Lower latency, data privacy, and cost predictability. All in the box with Foundry Local. Start here. Build once, deploy everywhere. Foundry Local ensures your AI app works on Intel, AMD, Qualcomm, and NVIDIA devices. See how it works. QUICK LINKS: 00:00 — Run AI locally 01:48 — Local AI use cases 02:23 — App portability 03:18 — Run apps on any device 05:14 — Run on older devices 05:58 — Run apps on MacOS 06:18 — Local AI is Multi-modal 07:25 — How it works 08:20 — How to get it running on your device 09:26 — Start with AI Toolkit in VS Code with new SDK 10:11 — Wrap up Link References Check out https://aka.ms/foundrylocalSDK Build an app using code in our repo at https://aka.ms/foundrylocalsamples Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: - If you want to build apps with powerful AI optimized to run locally across different PC configurations, in addition to macOS and mobile platforms, while taking advantage of bare metal performance, where your same app can run without modification or relying on the cloud, Foundry Local with the new SDK is the way to go. Today, we’ll dig deeper into how it works and how you can use it as a developer. I’m joined today by Raji Rajagopalan, who leads the Foundry Local team at Microsoft. Welcome. - I’m very excited to be here, Jeremy. Thanks for having me. - And thanks so much for joining us today, especially given how fast things are moving quickly in this space. You know, the idea of running AI locally has really shifted from exploration, like we saw over a year ago, to real production proper use cases right now. - Yeah, things are definitely moving fast. We are at a point for local AI now where several things are converging. First, of course, hardware has gotten more powerful with NPUs and GPUs available. Second, we now have smarter and more efficient AI models which need less power and memory to run well. Also, better quantization and distillation mean that even big models can fit and work well directly on your device. This chart, for example, compares the GPT-3.5 Frontier Model, which was one of the leading models around two years ago. And if I compare the accuracy of its output with a smaller quantized model like gpt-oss, you’ll see that bigger isn’t always better. The gpt-oss model exceeds the larger GPT-3.5 LLM on accuracy. And third, as I’ll show you, using the new Foundry Local SDK, the developer experience for building local AI is now a lot simpler. It removes a ton of complexity for getting your apps right into production. And because the AI is local, you don’t even need an Azure subscription. - Okay, so what scenarios do you see this unlocking? - Well, there’s a lot of scenarios that local AI can be quite powerful, actually. For example, if you are offline on a plane or are working in a disconnected or poor connectivity location, latency is an issue. These models will still run. There’s no reliance on the internet. Next, if you have specific privacy requirements for your data, data used for AI reasoning can be stored locally or within your corporate network versus the cloud. And because inference using Foundry Local is free, the costs are more predictable. - So lower latency data privacy, cost predictability. Now, you also mentioned a simpler developer experience with a new Foundry Local SDK. So how does Foundry Local change things? - Well, the biggest issue that we are addressing is app portability. For example, as a developer today, if you wanted to build an AI app that runs locally on most device hardware and across different OS platforms, you’d have to write the device selection logic yourselves and debug cross-platform issues. Once you’re done that, you would need to package it for the different execution providers by hardware type and different device platforms just so that your app could run on those platforms and across different device configurations. It’s an error-prone process. Foundry Local, on the other hand, makes it simple. We have worked extensively with our silicon partners like NVIDIA, Intel, Qualcomm, and AMD to make sure that Foundry Local models just work right on the hardware that you have. - Which is great, because as a developer, you can just focus on building your app. The same app is going to target and work on any consuming device then, right? - That’s right. In fact, I’ll show you. I have built this healthcare concierge app that’s an offline assistant for addressing healthcare questions using information private to me, which is useful when I’m traveling. It’s using a number of models, including the quantized 1.5 billion parameter Qwen model, and it has options to choose other models. This includes the Whisper model for spoken input using speech-to-text conversion, and it can pull from multiple private local data sources using semantic search to retrieve the information it needs to generate responses. I’m going to run the app on different devices with diverse hardware. I’ll start with Windows, and after that I’ll show you how it works on other operating systems. Our first device has a super common configuration. It’s a Windows laptop running Intel Core previous generation with in integrated GPU and no NPU. I have another device, which is an AMD previous-generation PC, also without an NPU. Next, I have a Qualcomm Snapdragon X Plus PC with an NPU. And my fourth device is an Intel PC with an NVIDIA RTX GPU. I’m going to use the same prompt on each of these devices using text first. I’ll prompt: If I have 15 minutes, what exercises can I do from anywhere to stay healthy? And as I run each of these, you’ll see that the model is being influenced across different chipsets. This is using the same app package to support all of these configurations. The model generates its response using its real world training and reasoning over documents related to my medical history. By the way, I’m just using synthetic data for this demo. It’s not my actual medical history. But the most important thing is that this is all happening locally. My private data stays private. Nothing is traversing to or from the internet. - Right, and I can see this being really great for any app scenario that requires more stringent data compliance. You know, based on the configs that you ran across those four different machines that you remoted into, they were relatively new, though. Would it work on older hardware as well? - Yeah, it will. The beauty of Foundry Local is that it makes AI accessible on almost any device. In fact, this time I’m remoted into an eighth-gen Intel PC. It has integrated graphics and eight gigs of RAM, as you can see here in the task manager. I’ll minimize this window and move over to the same app we just saw. I’ll run the same prompt, and you’ll see that it still runs even though this PC was built and purchased in 2019. - And as we saw, that went a little bit slower than some of the other devices, but that’s not really the point here. It means that you as a developer, you can use the same package and it’ll work across multiple generations and types of silicon. - Right, and you can run the same app on macOS as well. Right here, on my Mac, I’ll run the same code. We have here a Foundry Local packaged for macOS. I’ll run the same prompt as before, and you’ll see that just like it ran on my Windows devices, it runs on my Mac as well. The app experience is consistent everywhere. And the cool thing is that local AI is also multimodal. Because this app supports voice input, this time I’ll speak out my prompt. First, to show how easy it is to change the underlying AI model. I’ll swap it to Phi-4-mini-reasoning. Like before, it is set up to use locally stored information for grounding, and the model’s real-world understanding to respond. This time I’ll prompt it with: I’m about to go on a nine-hour flight and will be in London. Given my blood results, what food should I avoid, and how can I improve my health while traveling? And you’ll see that it’s converted my spoken words to text. This prompt requires a bit more reasoning to formulate a response. With the think steps, we can watch how it breaks down, what it needs to do, it’s reasoning over the test results, and how the flight might affect things. And voila, we have the answer. This is the type of response that you might have expected running on larger models and compute in the cloud, but it’s all running locally with sophistication and reasoning. And by the way, if you want to build an app like this, we have published the code in our repo at aka.ms/foundrylocalsamples. - Okay, so what is Foundry Local doing then to make all of this possible? - There’s lots going on under the covers, actually. So let’s unpack. First, Foundry Local lets you discover the latest quantized AI models directly from the Foundry service and bring them to your local device. Once cached, these models can run locally for your apps with zero internet connectivity. Second, when you run your apps, Foundry Local provides a unified runtime built on ONNX for portability. It handles the translation and optimization of your app for performance, tailored to the hardware configuration it’s running on, and it’ll select the right execution provider, whether it’s OpenVINO for Intel, the AMD EP, NVIDIA CUDA, or Qualcommm’s QNN with NPU acceleration and more. So there’s no need to juggle multiple SDKs or frameworks. And third, as your apps interact with cached local models, Foundry Local manages model inference. - Okay, so what would I or anyone watching need to do to get this running on their device? - It’s pretty easy. I’ll show you the manual steps for PC or Mac for anyone to get the basics running. And as a developer, this can all be done programmatically with your application’s installer. Here I have the terminal open. To install Foundry Local using PowerShell, I’ll run winget install Microsoft.FoundryLocal. Of course, on a Mac, you would use brew commands. And once that’s done, you can test it out quickly by getting a model and running something like Foundry model run qwen 2.5–0.5b, or whichever model you prefer. And this process dynamically checks if the model is already local, and if not, it’ll download the right model variant automatically and load it into memory. The time it’ll take to locally cache the model will depend on your network configuration. Once it’s ready, I can stay in the terminal and run a prompt. So I’ll ask: Give me three tips to help me manage anxiety for a quick test. And you’ll see that the local model is responding to my prompt, and it’s running 100% local on this PC. - Okay, so now you have all the baseline components installed on your device. Now, how do you go about building an app like we saw before? - The best way to start is in AI Toolkit in VS Code. And with our new SDK, this lets you run Foundry Local models, manage the local cache, and visualize results within VS Code. So let me show you here. I have my project open in Visual Studio Code with the AI Toolkit installed. This is using OpenAI SDK, as you can see here. It is a C# app using Foundry Local to load and interact with local models on the user device. In this case, we are using a Qwen model by default for our chat completion. And it uses OpenAI Whisper Tiny for speech to text to make voice prompting work. So that’s the code. From there you can package it for Windows and Mac, and you can package it for Android too. - It’s really great to see Foundry Local in action. And I can really see it helping out with lighting up different local AI across the different devices and scenarios. So for all the developers who are watching right now, what’s the best way to get started? - I would say try it out. You don’t need specialized hardware or a dev kit to get started. First, to just get a flavor for Foundry Local on Windows, use the steps I showed with winget, and on macOS, use Brew. Then, and this is where you unlock the most, integrated into your local apps using the SDK. And you can check out aka.ms/foundrylocalSDK. - Thanks, Raji, It’s really great to see how far things have come in this space. And thank you for joining us today. Be sure to subscribe to Mechanics if you haven’t already. We’ll see you again soon.2.6KViews1like0Comments