azure ai foundry
222 TopicsBuilding and Operating a Microsoft Foundry Hosted Agent with GitOps and GitHub Tasks
The Gap Between Prototype and Production Most AI engineering teams can build a working agent in a day. The hard part is not building it; the hard part is operating it. Prompts drift. Tool configurations change without review. Deployments happen from someone's laptop. There is no audit trail, no rollback plan, and no consistent way to promote a change from a development environment to production. GitOps closes that gap. By treating your agent definition, configuration, and infrastructure as version-controlled source code, you get the same delivery discipline that software engineering teams have applied to application code for years. Every change is reviewed, every deployment is automated, and every environment state is traceable to a specific commit. This post shows you how to apply GitOps principles to a Microsoft Foundry Hosted Agent using GitHub as the source of truth and GitHub Tasks and Actions as the automation layer. The result is a repeatable, governed, production-ready delivery model for AI agents. What Is a Microsoft Foundry Hosted Agent? Microsoft Foundry is Microsoft's platform for building, deploying, and operating AI applications and agents. A Hosted Agent is an agent runtime managed by the Foundry platform rather than self-hosted by your team. You supply the agent logic, configuration, and tools; Foundry handles the runtime lifecycle, scaling, and managed infrastructure. In practical terms, a Foundry Hosted Agent is a containerised agent application. You package your agent code, prompt definitions, tool bindings, and environment configuration into a container image. Foundry deploys and manages that container within a Foundry project, connected to models, tools, and observability infrastructure that the platform provides. Teams choose Hosted Agents over self-hosting because: The platform manages runtime infrastructure, patching, and scaling Integration with Azure AI models, managed identity, and observability is built in You can focus engineering effort on agent logic rather than cluster management Foundry projects provide environment and resource isolation without requiring you to provision and manage separate Azure resources for each environment Hosted Agents are a good fit when your team wants strong operational support with minimal platform overhead, when you need clear separation between environments, and when your agents depend on Azure AI capabilities such as Azure OpenAI Service, Azure AI Search, or Model Context Protocol integrations. Why GitOps Matters Specifically for AI Agents GitOps is straightforward for stateless web services: the code changes, the pipeline runs, the container is deployed. AI agents are more complex because there are multiple distinct artefacts that all affect agent behaviour: System prompts and instruction files Tool definitions and external integrations Model selection and configuration (temperature, max tokens, safety settings) Model Context Protocol (MCP) server definitions Orchestration logic and agent workflow code Safety and policy settings Infrastructure and deployment configuration Any one of these can change the behaviour of your agent in ways that are difficult to detect without structured review. A prompt change that looks harmless can alter tone, scope, or factual grounding. A tool configuration change can expose data to unintended callers. A model upgrade can shift response quality unpredictably. Git gives you a single place to version, review, and approve all of these artefacts together. Pull requests give you a structured review gate. Workflow automation gives you validation before anything reaches a deployed environment. Tags and releases give you deployment markers you can roll back to. The discipline of GitOps turns what is often an ad-hoc AI delivery process into a repeatable engineering practice. Reference Architecture The following diagram shows a practical reference architecture for delivering a Microsoft Foundry Hosted Agent through a GitOps model using GitHub. +---------------------------+ | GitHub Repository | | /src /agents /tools | | /prompts /infra | | /.github/workflows | +---------------------------+ | | Pull Request / Push to main v +---------------------------+ | GitHub Actions | | 1. Validate agent config | | 2. Lint and scan code | | 3. Run unit tests | | 4. Build container image | | 5. Push to registry | +---------------------------+ | | Image tag (SHA or semver) v +---------------------------+ | Azure Container Registry | | myregistry.azurecr.io | | my-agent:<sha> | +---------------------------+ | +------+------+ | | v v +----------+ +----------+ | Foundry | | Foundry | | Dev | | Test | | Project | | Project | +----------+ +----------+ | Approval gate (GitHub env) | v +----------+ | Foundry | | Prod | | Project | +----------+ | v +---------------------------+ | Observability | | Azure Monitor / App | | Insights / Foundry Logs | +---------------------------+ Key design decisions in this architecture: The GitHub repository is the single source of truth for all agent artefacts No human deploys directly to any Foundry project; all changes flow through automation Environment promotion requires a GitHub environment approval, creating a governance gate The container image is built once and promoted across environments; the image is not rebuilt per environment Secrets are stored in Azure Key Vault and accessed by the Foundry agent at runtime via managed identity Figure: GitOps delivery pipeline stages from commit to production Repository Structure A well-structured repository separates agent logic from infrastructure and tooling from prompts. The following structure works well in practice: my-foundry-agent/ ├── .github/ │ ├── workflows/ │ │ ├── validate.yml # Runs on every PR │ │ ├── build-deploy.yml # Runs on merge to main │ │ └── rollback.yml # Manual trigger workflow │ └── CODEOWNERS # Review assignments by path ├── src/ │ ├── agents/ │ │ ├── agent.py # Agent entry point and orchestration │ │ └── agent_config.json # Agent metadata and settings │ ├── tools/ │ │ ├── search_tool.py # Tool implementations │ │ └── data_tool.py │ └── prompts/ │ ├── system.txt # System prompt (versioned as plain text) │ └── instructions.txt # Supplementary instructions ├── tests/ │ ├── unit/ # Unit tests for tools and logic │ ├── integration/ # Integration tests against a running agent │ └── smoke/ # Post-deployment smoke tests ├── infra/ │ ├── main.bicep # Foundry project and resource definitions │ └── environments/ │ ├── dev.parameters.json │ ├── test.parameters.json │ └── prod.parameters.json ├── scripts/ │ ├── validate_agent.py # Config validation script │ └── smoke_test.py # Smoke test runner ├── Dockerfile # Container image definition └── docs/ └── architecture.md # Architecture and runbook documentation What belongs where and why: /src/prompts - System prompts as plain text files. Versioning prompts as files means every change goes through a pull request with a diff review, just as code does. /src/agents - Agent orchestration logic and configuration. Keeps the entry point and agent metadata co-located. /src/tools - Tool implementations separated from agent logic. Tool logic changes independently and should be reviewable in isolation. /infra - Infrastructure as code with per-environment parameter files. Environment-specific values live here, never in source files. /tests - Three layers of testing: unit tests for tools, integration tests for the full agent, and smoke tests that run against a deployed environment. /.github/workflows - All automation defined as code. There should be no manual deployment steps that live outside this directory. GitHub Tasks Across the Delivery Lifecycle GitHub Tasks and Issues provide the work tracking layer on top of the GitOps delivery model. Used well, they connect the intention behind a change to its implementation and deployment history. Practical patterns for using GitHub Tasks with agent delivery: Prompt change task - Open an issue to describe why the system prompt is changing. The pull request that changes system.txt closes that issue, creating a permanent link between the rationale and the diff. Tool integration task - When adding a new MCP server or external tool integration, create a task that captures the design decision, security review outcome, and test evidence before the pull request is merged. Model upgrade task - When upgrading the underlying model version, create a task that includes evaluation results and comparison data. The task becomes part of your change audit trail. Rollback task - If a deployment causes quality regressions, create a task to track the rollback, root cause investigation, and corrective action. Automation can open this task automatically when a deployment fails health checks. Dependency on approval - GitHub Tasks can be linked to environment approvals in GitHub Actions. A task in a specific milestone or project column can gate a promotion workflow. The key insight is that GitHub Tasks are not just work management; they are part of your audit trail. A regulatory or security reviewer can follow the chain from a production deployment back through workflow runs, pull request reviews, and the original task that described the intent of the change. End-to-End GitOps Flow The following walk-through describes a realistic developer experience for changing an agent prompt and promoting it to production. A developer opens a GitHub Issue describing the prompt change required and the expected behaviour improvement. The developer creates a feature branch, edits src/prompts/system.txt , and updates any related unit tests. A pull request is opened. The validate workflow runs immediately, checking prompt length, configuration schema, and lint rules. Unit tests run against the changed files. A code reviewer approves the pull request. The CODEOWNERS file ensures that prompt changes require review from the AI engineering team, not just any contributor. On merge to main, the build workflow runs: the container image is built with the new prompt baked in, tagged with the commit SHA, and pushed to Azure Container Registry. The deployment workflow deploys the new image to the Foundry Dev project automatically. Integration and smoke tests run against the deployed dev agent. If tests pass, the workflow pauses at the Test environment gate and requests approval from a named reviewer. After approval, the same image is deployed to Foundry Test. Smoke tests run again. A second approval gate controls promotion to Foundry Prod. If at any point a health check or smoke test fails, the rollback workflow redeploys the previous image tag from the registry. The image tag of the last known-good deployment is stored as a GitHub environment variable. This flow means that no human ever deploys directly to any environment. Every environment state is traceable to a specific commit, image tag, and workflow run. Security and Governance AI agents often have access to sensitive data and external systems. Security and governance cannot be an afterthought. Identity and Access Use managed identity for the Foundry Hosted Agent to access Azure resources. Avoid service principal secrets where Microsoft Entra Workload Identity or managed identity is available. Apply the principle of least privilege: the agent identity should have read access to data sources and limited write access only where the use case requires it. Tool integrations that require API keys or external credentials should retrieve them from Azure Key Vault at runtime, never from environment variables baked into the image. Secrets and Configuration Store secrets in Azure Key Vault. Reference them in your Foundry project configuration using Key Vault references. Store GitHub Actions secrets using repository or environment-scoped secrets. Never echo secrets in workflow logs. Separate environment configuration (endpoints, resource names, capacity settings) from agent logic. Use the /infra/environments/ parameter files for this. Auditability and Review Enforce pull request reviews for all changes to /src/prompts , /src/agents , and /infra via CODEOWNERS. Require status checks to pass before merging. Blocked merges prevent untested changes reaching production. GitHub's workflow run history gives you a complete deployment audit trail. You can answer "what was deployed to prod on Tuesday and who approved it" in seconds. For regulated environments, consider branch protection rules that require signed commits. Safe Rollout Use canary or blue-green patterns where Foundry supports them for high-traffic agents. Always keep the previous image tag available in the registry. Do not delete images on deployment. Document and test your rollback procedure before you need it in production. Observability and Operational Readiness A deployed agent that you cannot observe is an agent you cannot operate. Build observability in from the start. What to Monitor Deployment health - Track whether each Foundry deployment succeeded and the agent is responding. Wire deployment outcomes back to GitHub workflow run status. Model and tool errors - Log tool call failures, model timeout errors, and safety filter activations. Aggregate these in Azure Monitor or Application Insights. Latency - Track end-to-end response latency per agent version. A latency increase after a model or prompt change is an early signal of a quality regression. Token consumption - Monitor token usage per request and per session. Unexpected increases can indicate prompt injection or runaway orchestration loops. Traceability - Log which agent version handled each request. Correlation between the image tag and request traces is essential for debugging production issues. Debugging and Alerting Use structured logging with a consistent schema. Include fields for agent version, session ID, tool called, and outcome. Set up alerts for error rate thresholds and latency percentiles. Alert before users notice the problem. For failed agent runs, ensure logs capture the full conversation context (within your data retention policy) so that developers can reproduce and diagnose the failure. Microsoft Foundry Toolboxes One of the most important additions to the Foundry platform is Toolboxes, currently in Public Preview. If you have ever seen an agent codebase where three different agents each wire the same search tool with their own credentials and slightly different configurations, you already understand the problem Toolboxes solve. A Toolbox is a named, versioned bundle of tools managed centrally in Microsoft Foundry. You define the tools once, configure authentication and access centrally, and publish a single MCP-compatible endpoint. Any agent in any runtime consumes that endpoint without per-tool wiring, custom SDK integration, or duplicated credential management. Figure: Before and after Foundry Toolboxes. Each agent previously managed its own tool connections. With Toolboxes, agents connect to one governed endpoint. The Four Pillars Discover (coming soon) - Find approved tools without browsing long catalogues. Reduces duplication by surfacing what already exists before developers build something new. Build (available today) - Select tools into a named toolbox. Supported types include built-in tools (Web Search, Code Interpreter, File Search, Azure AI Search), MCP servers, Agent-to-Agent (A2A) endpoints, and OpenAPI-defined services. Consume (available today) - A single MCP-compatible endpoint exposes every tool in the toolbox to any agent runtime. Agents that can speak MCP can use a Foundry Toolbox without any Foundry-specific SDK dependency. Govern (coming soon) - Centralised authentication and observability applied to every tool call flowing through the toolbox. Security and platform teams get consistent controls without asking developers to bolt governance onto every agent individually. Toolboxes and GitOps: A Natural Fit Toolboxes are particularly well-suited to a GitOps delivery model because the toolbox definition is a discrete, versioned artefact. Instead of credentials and tool configuration scattered across agent codebases, the toolbox becomes its own managed entity with its own version history. The key design property is that the toolbox endpoint URL is stable. When you promote a new toolbox version to be the default, agents consuming the endpoint pick up the update without any code changes. This means you can update tool configuration, add a new MCP server, or rotate credentials in the toolbox without redeploying every agent that uses it. Figure: Toolbox versioning in a GitOps model. Commits trigger CI validation and deployment of new toolbox versions. The stable endpoint URL allows agents to consume updates without redeployment. Adding a Toolbox to Your Repository In your GitOps repository, toolbox definitions belong in /src/tools/toolbox_config.py or as a declarative configuration file checked into version control. The following example creates a toolbox that combines web search, Azure AI Search over internal documentation, and a GitHub MCP server: # src/tools/toolbox_config.py # Run this via CI to create or update a toolbox version in Foundry. from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient import os client = AIProjectClient( endpoint=os.environ["FOUNDRY_PROJECT_ENDPOINT"], credential=DefaultAzureCredential() ) toolbox_version = client.beta.toolboxes.create_toolbox_version( toolbox_name="customer-feedback-toolbox", description="Tools for triaging customer feedback: search, docs, and GitHub.", tools=[ { "type": "web_search", "description": "Search approved public documentation sites.", "custom_search_configuration": { "project_connection_id": os.environ["BING_CONNECTION_NAME"], "instance_name": os.environ["BING_INSTANCE_NAME"] } }, { "type": "azure_ai_search", "name": "product-manuals-search", "description": "Search internal product documentation.", "azure_ai_search": { "indexes": [ { "index_name": os.environ["SEARCH_INDEX_NAME"], "project_connection_id": os.environ["SEARCH_CONNECTION_ID"] } ] } }, { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp", "project_connection_id": os.environ["GITHUB_CONNECTION_ID"] } ], ) print(f"Toolbox version created: {toolbox_version.version}") print(f"MCP endpoint: {toolbox_version.mcp_endpoint}") To promote a toolbox version to be the default (the endpoint agents use without specifying a version), add this to your deployment workflow: # Promote toolbox version to default after validation toolbox = client.beta.toolboxes.update( toolbox_name="customer-feedback-toolbox", default_version=toolbox_version.version, ) print(f"Default version is now: {toolbox.default_version}") The stable endpoint for agents consuming this toolbox is: https://<your-project>.services.ai.azure.com/api/projects/<project>/toolbox/customer-feedback-toolbox/mcp?api-version=v1 Attaching the Toolbox to Your Hosted Agent In your agent code, connect to the toolbox via a single MCP tool definition. The agent gains access to every tool in the toolbox without knowing their individual configurations: # src/agents/agent.py (relevant excerpt) from agent_framework import MCPStreamableHTTPTool import httpx, os toolbox_endpoint = os.environ["FOUNDRY_TOOLBOX_ENDPOINT"] http_client = httpx.AsyncClient( auth=_ToolboxAuth(token_provider), # Microsoft Entra bearer token timeout=120.0, ) mcp_tool = MCPStreamableHTTPTool( name="toolbox", url=toolbox_endpoint, http_client=http_client, load_prompts=False, ) # Agent now has access to web search, AI Search, and GitHub MCP # through one tool definition and one authenticated connection. GitOps Workflow Extension for Toolboxes Add a dedicated job to your build-deploy workflow to create and promote toolbox versions as part of the same CI/CD pipeline: deploy-toolbox: name: Deploy Toolbox Version needs: validate runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Create toolbox version in Foundry env: FOUNDRY_PROJECT_ENDPOINT: ${{ vars.FOUNDRY_PROJECT_ENDPOINT_DEV }} BING_CONNECTION_NAME: ${{ vars.BING_CONNECTION_NAME }} BING_INSTANCE_NAME: ${{ vars.BING_INSTANCE_NAME }} SEARCH_INDEX_NAME: ${{ vars.SEARCH_INDEX_NAME }} SEARCH_CONNECTION_ID: ${{ vars.SEARCH_CONNECTION_ID }} GITHUB_CONNECTION_ID: ${{ vars.GITHUB_CONNECTION_ID }} run: python src/tools/toolbox_config.py Key points to note: Toolbox configuration is Python code in source control, reviewed through pull requests like any other change Connection IDs and index names are environment variables from GitHub Actions variables, not hardcoded in the script The same script runs for dev, test, and prod with different environment variable bindings Toolbox version promotion is a separate step from agent deployment, so you can update tools independently of the agent container Because the toolbox endpoint is stable, rolling back a toolbox version does not require rolling back the agent image Common Pitfalls Teams adopting this pattern commonly make the following mistakes. Identifying them early saves significant operational pain later. Treating prompts as unmanaged text. If your system prompt lives in a portal text box rather than a versioned file, you have no history, no review process, and no rollback capability. Move prompts into source control on day one. Deploying manually from the portal. Even one manual deployment breaks the GitOps contract. Your repository no longer reflects the true state of the environment. Automate everything and remove portal deployment permissions from individuals. Mixing environment configuration into source files. Hardcoded endpoint URLs or model deployment names in agent_config.json mean your dev and prod configurations diverge at the source level. Use parameter files and environment variables resolved at deployment time. Poor separation between agent logic and tool logic. When agents and tools are tightly coupled in a single file, a tool change requires a full agent review and redeployment. Keep them separate so they can evolve independently. Not versioning your Toolbox definition. Defining a Foundry Toolbox interactively through the portal gives you no audit trail and no rollback path. The toolbox configuration script belongs in source control alongside your agent code. Skipping evaluation before promotion. Deploying a prompt change without running a structured evaluation against a representative test set is how regressions reach production. Build evaluation into the pull request workflow, not just the deployment workflow. No rollback plan. If your first rollback is unplanned and urgent, it will be slow and stressful. Test your rollback procedure in a non-production environment and document the steps. Ignoring token and cost signals. AI workloads have variable cost profiles. A change that doubles average token consumption per request may be functionally correct but economically unsustainable. Monitor consumption as a first-class signal. Example GitHub Actions Workflow The following workflow runs on pull request validation and on merge to main. It covers the core delivery lifecycle: validate, build, deploy to dev, and smoke test. # .github/workflows/build-deploy.yml name: Build and Deploy Foundry Hosted Agent on: push: branches: - main pull_request: branches: - main env: REGISTRY: myregistry.azurecr.io IMAGE_NAME: my-foundry-agent jobs: validate: name: Validate Agent Configuration runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.12" - name: Install dependencies run: pip install -r requirements.txt - name: Validate agent config schema run: python scripts/validate_agent.py - name: Run unit tests run: pytest tests/unit/ -v - name: Lint code run: ruff check src/ build: name: Build and Push Container Image needs: validate runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' permissions: id-token: write contents: read outputs: image_tag: ${{ steps.meta.outputs.version }} steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Log in to Azure Container Registry run: az acr login --name ${{ env.REGISTRY }} - name: Extract metadata id: meta uses: docker/metadata-action@v5 with: images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} tags: | type=sha,format=short - name: Build and push image uses: docker/build-push-action@v7 with: context: . push: true tags: ${{ steps.meta.outputs.tags }} deploy-dev: name: Deploy to Foundry Dev needs: build runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Dev project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_DEV }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment dev - name: Run smoke tests against dev run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_DEV }} deploy-test: name: Deploy to Foundry Test needs: deploy-dev runs-on: ubuntu-latest environment: test permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_TEST }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Test project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_TEST }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment test - name: Run smoke tests against test run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_TEST }} Key decisions in this workflow: Validation runs on every pull request, not just on merge. Fast feedback catches problems before review. The container image is built once and the image tag is passed forward to deployment jobs. The same artefact is promoted across environments. Authentication uses OIDC federated credentials via azure/login@v3 with id-token: write permissions. No long-lived secrets are stored in GitHub for Azure authentication. The environment: test directive in the deploy-test job triggers a GitHub environment approval gate. A named reviewer must approve before the job runs. Smoke tests run after every deployment. A failed smoke test prevents further promotion. Best Practices Checklist Use this checklist when adopting the GitOps pattern for a Microsoft Foundry Hosted Agent: All agent artefacts, including prompts, tool definitions, model configuration, and Toolbox configuration scripts, are committed to source control No manual deployments to any environment; all changes flow through GitHub Actions workflows Pull request reviews are enforced for all changes to agent logic, prompts, and infrastructure via CODEOWNERS Unit tests cover tool logic; integration tests cover end-to-end agent behaviour; smoke tests cover deployed environments Container images are built once per commit and promoted across environments; images are not rebuilt per environment Environment configuration (endpoints, resource names) lives in parameter files, never in source code Secrets are stored in Azure Key Vault and accessed via managed identity at runtime GitHub environment approval gates control promotion from dev to test to prod Foundry Toolboxes are used to centralise tool definitions, credentials, and access governance across all agents; the toolbox configuration script is version-controlled and deployed through CI/CD Toolbox versions are promoted via the update default_version API step in the deployment workflow, not manually through the portal Latency, error rate, and token consumption are monitored with alerting thresholds The rollback procedure is documented, automated, and has been tested in a non-production environment GitHub Issues are used to record the intent behind significant changes and link to the pull requests that implement them Branch protection rules prevent direct pushes to main and require status checks to pass before merge The previous image tag is retained in the registry and stored as a GitHub environment variable for rollback Conclusion A Microsoft Foundry Hosted Agent is not something you deploy once and forget. Prompts evolve, tools change, models are upgraded, and policy requirements shift. Every one of those changes has the potential to alter agent behaviour in ways that affect users, costs, and compliance posture. GitOps, implemented through GitHub and GitHub Tasks, gives you the operational discipline to manage that complexity. Source control for all artefacts. Pull request review for every change. Automated validation, build, and deployment. Environment promotion gates. A complete audit trail from task to production. These are not bureaucratic overhead; they are the foundation of reliable, trustworthy AI agent operations. The teams that operate AI agents well are the ones that treat them like production software from the start. The investment in pipeline, structure, and governance pays back every time a change goes smoothly, every time a rollback takes minutes rather than hours, and every time a security or compliance reviewer can answer their question from a pull request history rather than a support ticket. Build the discipline in early. Your future self, and your production environment, will benefit from it. References Microsoft Foundry documentation Microsoft Foundry Agent Service documentation Microsoft Foundry Toolboxes documentation Introducing Toolboxes in Foundry (Microsoft Developer Blog) GitHub Actions documentation GitHub Projects and Tasks documentation Azure Container Registry documentation Azure Key Vault documentation Microsoft Entra Managed Identities documentation OpenGitOps PrinciplesAutomate evaluations | Microsoft Foundry
Trace every run end-to-end, generate synthetic datasets to stress-test on demand, fire automated Red Team attacks at your own agents, and pin down why evaluations fail — all from the Microsoft Foundry control plane. Lock in guardrails that inspect every tool call at runtime, define the risks once, and enforce them across every agent run. Mohammad Abuomar, Responsible AI Principal Architect, shares how to turn a coding agent into production-ready software inside Foundry. Describe the agent, set the row count, confirm. Your test set lands in seconds. Microsoft Foundry’s synthetic dataset generator builds eval data on demand. Get started. Pin down why your agent fails evaluations. Foundry’s Analyze Results uses AI to cluster failures, name the root cause, and recommend specific fixes. Check it out. Lock down agent behavior with the Task Adherence Guardrail. It inspects every tool call against the original task and blocks the off-script ones. Try it in Microsoft Foundry. QUICK LINKS: 00:00 — Microsoft Foundry control plane 00:33 — See a finished agent 02:30 — See where the agent started 03:19 — Traces 04:04 — Built-in monitoring 04:34 — Evaluation types 05:51 — Red team evaluations 07:08 — Evaluation results 08:14 — Built-in Guardrails 08:14 — Wrap up Link References Get everything you need in Microsoft Foundry at https://ai.azure.com Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: -If you want to build agents that meet your expectations for output quality, performance, safety, and cost, it’s not just about the model or framework you select. The testing, evaluation, and the controls surrounding your agent matter. And that’s what the Microsoft Foundry control plane is designed to do, with tools you can use during development to make sure your agents raise the bar across every important dimension. Today, I’m going to walk through the process of building a coding agent and demonstrate where the controls in Foundry come in to make it better. I’ll start by showing my finished agent, then after that, I’ll show you the steps I took using Foundry to make it production ready. This agent is designed to take a simple user prompt, then find what it needs to build apps automatically. -So, I’ll paste in my prompt asking it to generate a Windows desktop app for personal cashflow management. It needs to be fast, use WebUI, and easy to use for broad appeal. I’m also asking it to make it safe, secure, and follow privacy best practices. And it needs to be easy for a developer to read, maintain, and to add to it. I’ll submit the request and it gets to work, with its reasoning on the left and code on the right. This process takes several minutes to complete with a few interactions in between, so to save a little time, I’ll skip to the result. We can see the agent’s reasoning and plan, with its technology stack, approach and initial action. Then, we can follow all of the steps it performed to author and configure the app and its dependencies. -Below that, is the React code and JavaScript. It asked whether to proceed writing this as an Electron and React setup, and I confirmed. Then it started to write, test and iterate on the app, followed by another question whether to implement more features or focus on security. And I responded to do both. It then finished writing the app and finally it outlined the steps to run the app locally. -So, let’s test it out. I’ll move over to my terminal window running PowerShell and start it. And here is the generated app. It’s fully functional with user authentication. I can enter my first item, Travel Expenses, and the amount, and there’s a Category dropdown menu with pre-configured options, and I’ll choose “Transportation”. And it writes that record into the local data store. This is a simple, production-ready app that the agent was able to create in just a few minutes. But it didn’t start out this way, and if you’ve built agents or apps yourself, you’ll know a lot of what doesn’t get shown is the testing, iteration, and refinement work to end up with production-ready code. Let’s change that. -Let’s go back in time to where this agent started. I’m in Visual Studio Code and this is my agent, which I built using the Foundry SDK. Here are the defined tools for it to use, WebSearch and CodeInterpreter. And on the left, we can see the full list of local tools. Like interacting with the file system, as well as git, patching, registry, local search and running shell scripts. And here in the center is the key SDK line that creates the agent, adds the tools, deployment name and so on. -So, the agent is functional and I’ve also started manual testing. And this is where Foundry controls let me stress‑test the agent to see what works and what doesn’t and see the details for each run. In the Microsoft Foundry portal, I have my agent open and the Traces tab. These are OTel traces of all of the runs for this agent, with the newest runs on top, everything here is backed by Azure Monitor. And I can click into any conversation or Trace ID to view the Input + Output turns for that session. They’re easier to parse than standard logs, speeding up reviews. We can also see the system message, user input, and what the agent did. Along with the agent’s reasoning, the technology stack it used, and the app features. Below that, we can see the development process as well as tool outputs Beyond that, with built-in monitoring, you can get a roll-up view of all activities for our agent with key metrics I’m in the Monitor tab. It shows me the estimated cost and token usage so far. This agent is new so I haven’t configured Evaluations yet, but we’ll get to those in a moment. -Next, you’ll see Operational metrics like the number of agent runs and how many successfully completed or failed, token consumption, tool calls made by the agent, and the error rate over time. Evaluations are where a lot more testing automation comes in to help you improve agent faster. I’m in the Evaluations tab, I need to create my first one. The options are: Automatic Evaluation, where you can automate the process using AI; Human Evaluation, where someone tests the agent and completes surveys; and Red team, where an agent runs automated attacks to expose vulnerabilities. I’ll start with Automatic Evaluation and hit Create. It starts with defining a target. My agent and the version I want are already selected. For data, I can upload an existing dataset or save time by creating a synthetic dataset, which is very cool. This generates data automatically, you just select the number of rows you want. I’ll guide it with a prompt, “Create a dataset for evaluating a coding agent.” I’ll skip the reference file and just Confirm. That automatically generates 90 rows of data to test with. -Next, I’ll choose the evaluation Criteria. There are several built-in evaluators for Agents. Below that are evaluators for Quality. These are editable, so I’ll remove Coherence, Fluency, and Groundedness because my agent doesn’t need them. For Safety, there are seven evaluators, and I’ll keep them as-is and move on to Review, then Submit it. These Automatic Evaluations can take several minutes to complete, so while it’s working, I’ll move into Red Teaming, which is now becoming a core part of AI testing to spot vulnerabilities early on. I’ve started creating my first red team evaluation. Let’s look at the standard configuration for risk categories. You can modify these. It can check for unsafe categories plus ungrounded attributes, code vulnerabilities, and task adherence. It shows the tools that the agent can access. I’ll provide descriptions for web_search, to search the internet for relevant SDKs, and the code_interpreter to run code for the coding agent. Then I’ll Save it. -Next, I’ll change Seed queries from 5 to 10 per category for more testing. In the Attack strategies, I can see exactly what the red teaming agents will try to do and select the ones most relevant to my agent. Each tile describes the attack type that will be tested. I’ll choose AsciiSmuggler, Base64, Jailbreak, StringJoin, UnicodeSubstitution, and IndirectJailbreak. Now, I can review the prohibited actions, including things like attempts to change password, and more. These are all things attackers might try to do with your agent, and we’re automating those tests for you. I’ll hit Submit to get everything started. Now, with two evaluations running, to save a little time, I’ll fast forward to the results of the evaluations. -Here, we can see the two runs. I’ll open the Automatic Evaluation first. Then clicking into the Run shows the details for each evaluator. If I scroll to the right, you’ll see that we’re green almost across the board. One glaring exception is the TaskCompletion score at 59%, which is below my bar, so it’s something to fix. One of my favorite capabilities in evaluation is using AI to analyze the results. I’ll start the analysis, and it creates a nice cluster analysis showing the main issues. I mentioned TaskCompletion before. Here, you can see “incomplete resolution” and “action plan issues”. Drilling in, looks that there is a “lack of actionable output” and the AI suggests specific ways to fix it. This saved me time to find ways to improve my agent. -Now, let’s review our Red Teaming evaluation. I’m at the top level view and I’ll click in to see the issues. Immediately, I can see that the Task adherence is red, which is also related to TaskCompletion. We can fix this using a built-in guardrail to check for task adherence. Guardrails define what risks to detect, from which point in the process, and how to respond. Let’s go to the agent playground. Scrolling down to Guardrails, I can see only the default model guardrail is set. Let’s add another by clicking Manage guardrails and Create. Here, I can define the risks and controls I want to enforce. I’ll start with Risk, and these are the types of risks we can detect and mitigate. There’s an option for “Task adherence” that I’ll choose. This guardrail checks any tool call made by the agent to ensure it’s used appropriately to “adhere” to the task. -Now, I just need hit Submit to activate this guardrail. And the TaskCompletion issue should now be fixed. In fact, here I’ve run another evaluation, and we can see that TaskCompletion is now green and everything meets our overall quality goals. With that, my agent is ready for broader use. And while I focused today on a single agent and using Foundry controls to test it, expose vulnerabilities, and make it better, Foundry also provides fleet-wide performance visibility across all agents and enables centrally applied and enforced policies and configurations to keep agents compliant. -To find out more and get started with these and other controls, you’ll find everything you need in Microsoft Foundry at ai.azure.com. Subscribe to Mechanics for the latest tech updates, and thanks for watching.205Views0likes0CommentsBuilding Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.Building an On-Device Voice Assistant with Microsoft Foundry Local
Why on-device voice still matters Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged. The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini , you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress. This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable. What we're building A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000 . The page captures microphone audio, posts it to /api/transcribe , then streams the chat reply back over Server-Sent Events from /api/chat . All inference runs locally through two Foundry Local models loaded into the same process. The shape of the pipeline: Microphone (browser MediaRecorder) │ WebM/Opus blob ▼ Client-side WAV encoder (16 kHz, mono, PCM-16) │ multipart/form-data ▼ FastAPI /api/transcribe │ ▼ Nemotron Speech Streaming En 0.6B (Foundry Local audio client) │ transcript text ▼ Chat LLM e.g. qwen2.5-0.5b (Foundry Local chat client) │ streamed tokens ▼ FastAPI /api/chat → SSE → browser bubble The version that bit us: foundry-local-sdk >= 1.1.0 Before any code, the single most important fact about this project: The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the alias nemotron-speech-streaming-en-0.6b and fail with model not found . The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore- sdk suffix), not foundry_local . The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about. Pin it explicitly: pip install --upgrade "foundry-local-sdk>=1.1.0,<2" And verify before anything else: python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))" # expect: sdk 1.1.0 Loading both models from one manager The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client. The wrapper used in the repo ( src/foundry_client.py ) does this: from foundry_local_sdk import Configuration, FoundryLocalManager FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron")) manager = FoundryLocalManager.instance chat_model = manager.load_model("qwen2.5-0.5b") stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b") chat_client = chat_model.get_chat_client() audio_client = stt_model.get_audio_client() Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB. The transcription surprise The first naive approach was the obvious one: with open(wav_path, "rb") as f: result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b") That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime. The fix is to use the streaming session API instead — a different native entry point ( core_interop.start_audio_stream ) that the streaming model does support. The repo isolates this in src/_nemotron_live.py : def transcribe_wav_live(audio_client, wav_path, *, language="en"): with wave.open(str(wav_path), "rb") as w: sample_rate = w.getframerate() channels = w.getnchannels() sample_width = w.getsampwidth() pcm = w.readframes(w.getnframes()) session = audio_client.create_live_transcription_session() session.settings.sample_rate = sample_rate session.settings.channels = channels session.settings.bits_per_sample = sample_width * 8 session.settings.language = language session.start() # Feed PCM in ~100 ms chunks from a worker thread, then stop. bytes_per_sec = sample_rate * channels * sample_width chunk_bytes = max(bytes_per_sec // 10, 1024) def _pusher(): try: for offset in range(0, len(pcm), chunk_bytes): session.append(pcm[offset:offset + chunk_bytes]) finally: session.stop() threading.Thread(target=_pusher, daemon=True).start() parts = [] for resp in session.get_stream(): for cp in getattr(resp, "content", []) or []: text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or "" if text: parts.append(text) return " ".join(p.strip() for p in parts if p.strip()).strip() Two things to notice: Push from a thread, read from the main coroutine. session.append() is a blocking write into the native stream and session.get_stream() is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session. Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency. Always session.stop() . Without it the generator never terminates and the request hangs. The other transcription surprise: browsers don't send WAV Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus . That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove. The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript: // Inside src/static/index.html async function webmToWav(blob) { const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 }); const buf = await ctx.decodeAudioData(await blob.arrayBuffer()); // Mix to mono const ch = buf.numberOfChannels; const mono = new Float32Array(buf.length); for (let c = 0; c < ch; c++) { const data = buf.getChannelData(c); for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch; } return encodeWav(mono, 16000); } function encodeWav(samples, sampleRate) { const buffer = new ArrayBuffer(44 + samples.length * 2); const view = new DataView(buffer); // RIFF header writeStr(view, 0, "RIFF"); view.setUint32(4, 36 + samples.length * 2, true); writeStr(view, 8, "WAVE"); // fmt chunk writeStr(view, 12, "fmt "); view.setUint32(16, 16, true); // PCM chunk size view.setUint16(20, 1, true); // PCM format view.setUint16(22, 1, true); // mono view.setUint32(24, sampleRate, true); view.setUint32(28, sampleRate * 2, true); // byte rate view.setUint16(32, 2, true); // block align view.setUint16(34, 16, true); // bits per sample // data chunk writeStr(view, 36, "data"); view.setUint32(40, samples.length * 2, true); // PCM-16 samples let o = 44; for (let i = 0; i < samples.length; i++, o += 2) { const s = Math.max(-1, Math.min(1, samples[i])); view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true); } return new Blob([view], { type: "audio/wav" }); } Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side. Wiring it into FastAPI The server ( src/app.py ) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request: @app.post("/api/transcribe") async def transcribe(audio: UploadFile = File(...)): data = await audio.read() with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: f.write(data); path = f.name text = _ai_client.transcribe(path) return {"text": text} @app.post("/api/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( _sse(_ai_client.stream_completion(req.messages)), media_type="text/event-stream", ) return {"text": _ai_client.chat_completion(req.messages)} Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost . What it looks like The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone. Performance, honestly This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation: First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for qwen2.5-0.5b . Subsequent loads (warm cache): a few seconds per model. End-to-end transcription of a 5-second utterance: well under a second after warm-up. First chat token from qwen2.5-0.5b : typically 200–500 ms; full short reply within 1–2 s. On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model. Trade-offs to know about Model quality. qwen2.5-0.5b is a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap in phi-4-mini or mistral-nemo-12b-instruct if you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog. STT is English-only here. The current Nemotron streaming model in the catalog is ...-en-0.6b . Multilingual variants are likely to follow. Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny getUserMedia by default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real. No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour. Responsible AI considerations Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility: Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red ⏹ button and a "Recording…" banner for that reason. Don't log raw audio. The sample writes audio to a per-request NamedTemporaryFile and deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device. Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters. Try it Clone github.com/leestott/fl-nemotron. ./setup.ps1 (or ./setup.sh ) to create a virtualenv and install the pinned SDK. python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models. .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000 Open http://127.0.0.1:8000 in a real browser and click the 🎤 button. Where to go next Foundry Local documentation — official docs for the runtime, catalog, and SDK. microsoft/Foundry-Local — upstream samples and issue tracker. NVIDIA Nemotron model family — background on the speech and language models being published into the catalog. leestott/fl-nemotron — the full source for this post. Key takeaways Pin foundry-local-sdk >= 1.1.0 . Earlier SDKs cannot see the Nemotron Speech Streaming model. Use the LiveAudioTranscriptionSession API for Nemotron, not AudioClient.transcribe() . Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS. Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks. A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.Fine-Tuning and Deploying Phi-3.5 Model with Azure and AI Toolkit
What is Phi-3.5? Phi-3.5 as a state-of-the-art language model with strong multilingual capabilities. Emphasize that it is designed to handle multiple languages with high proficiency, making it a versatile tool for Natural Language Processing (NLP) tasks across different linguistic backgrounds. Key Features of Phi-3.5 Highlight the core features of the Phi-3.5 model: Multilingual Capabilities: Explain that the model supports a wide variety of languages, including major world languages such as English, Spanish, Chinese, French, and others. You can provide an example of its ability to handle a sentence or document translation from one language to another without losing context or meaning. Fine-Tuning Ability: Discuss how the model can be fine-tuned for specific use cases. For instance, in a customer support setting, the Phi-3.5 model can be fine-tuned to understand the nuances of different languages used by customers across the globe, improving response accuracy. High Performance in NLP Tasks: Phi-3.5 is optimized for tasks like text classification, machine translation, summarization, and more. It has superior performance in handling large-scale datasets and producing coherent, contextually correct language outputs. Applications in Real-World Scenarios To make this section more engaging, provide a few real-world applications where the Phi-3.5 model can be utilized: Customer Support Chatbots: For companies with global customer bases, the model’s multilingual support can enhance chatbot capabilities, allowing for real-time responses in a customer’s native language, no matter where they are located. Content Creation for Global Markets: Discuss how businesses can use Phi-3.5 to automatically generate or translate content for different regions. For example, marketing copy can be adapted to fit cultural and linguistic nuances in multiple languages. Document Summarization Across Languages: Highlight how the model can be used to summarize long documents or articles written in one language and then translate the summary into another language, improving access to information for non-native speakers. Why Choose Phi-3.5 for Your Project? End this section by emphasizing why someone should use Phi-3.5: Versatility: It’s not limited to just one or two languages but performs well across many. Customization: The ability to fine-tune it for particular use cases or industries makes it highly adaptable. Ease of Deployment: With tools like Azure ML and Ollama, deploying Phi-3.5 in the cloud or locally is accessible even for smaller teams. Objective Of Blog Specialized Language Models (SLMs) are at the forefront of advancements in Natural Language Processing, offering fine-tuned, high-performance solutions for specific tasks and languages. Among these, the Phi-3.5 model has emerged as a powerful tool, excelling in its multilingual capabilities. Whether you're working with English, Spanish, Mandarin, or any other major world language, Phi-3.5 offers robust, reliable language processing that adapts to various real-world applications. This makes it an ideal choice for businesses looking to deploy multilingual chatbots, automate content generation, or translate customer interactions in real time. Moreover, its fine-tuning ability allows for customization, making Phi-3.5 versatile across industries and tasks. Customization and Fine-Tuning for Different Applications The Phi-3.5 model is not just limited to general language understanding tasks. It can be fine-tuned for specific applications, industries, and language models, allowing users to tailor its performance to meet their needs. Customizable for Industry-Specific Use Cases: With fine-tuning, the model can be trained further on domain-specific data to handle particular use cases like legal document translation, medical records analysis, or technical support. Example: A healthcare company can fine-tune Phi-3.5 to understand medical terminology in multiple languages, enabling it to assist in processing patient records or generating multilingual health reports. Adapting for Specialized Tasks: You can train Phi-3.5 to perform specialized tasks like sentiment analysis, text summarization, or named entity recognition in specific languages. Fine-tuning helps enhance the model's ability to handle unique text formats or requirements. Example: A marketing team can fine-tune the model to analyse customer feedback in different languages to identify trends or sentiment across various regions. The model can quickly classify feedback as positive, negative, or neutral, even in less widely spoken languages like Arabic or Korean. Applications in Real-World Scenarios To illustrate the versatility of Phi-3.5, here are some real-world applications where this model excels, demonstrating its multilingual capabilities and customization potential: Case Study 1: Multilingual Customer Support Chatbots Many global companies rely on chatbots to handle customer queries in real-time. With Phi-3.5’s multilingual abilities, businesses can deploy a single model that understands and responds in multiple languages, cutting down on the need to create language-specific chatbots. Example: A global airline can use Phi-3.5 to power its customer service bot. Passengers from different countries can inquire about their flight status or baggage policies in their native languages—whether it's Japanese, Hindi, or Portuguese—and the model responds accurately in the appropriate language. Case Study 2: Multilingual Content Generation Phi-3.5 is also useful for businesses that need to generate content in different languages. For example, marketing campaigns often require creating region-specific ads or blog posts in multiple languages. Phi-3.5 can help automate this process by generating localized content that is not just translated but adapted to fit the cultural context of the target audience. Example: An international cosmetics brand can use Phi-3.5 to automatically generate product descriptions for different regions. Instead of merely translating a product description from English to Spanish, the model can tailor the description to fit cultural expectations, using language that resonates with Spanish-speaking audiences. Case Study 3: Document Translation and Summarization Phi-3.5 can be used to translate or summarize complex documents across languages. Its ability to preserve meaning and context across languages makes it ideal for industries where accuracy is crucial, such as legal or academic fields. Example: A legal firm working on cross-border cases can use Phi-3.5 to translate contracts or legal briefs from German to English, ensuring the context and legal terminology are accurately preserved. It can also summarize lengthy documents in multiple languages, saving time for legal teams. Fine-Tuning Phi-3.5 Model Fine-tuning a language model like Phi-3.5 is a crucial step in adapting it to perform specific tasks or cater to specific domains. This section will walk through what fine-tuning is, its importance in NLP, and how to fine-tune the Phi-3.5 model using Azure Model Catalog for different languages and tasks. We'll also explore a code example and best practices for evaluating and validating the fine-tuned model. What is Fine-Tuning? Fine-tuning refers to the process of taking a pre-trained model and adapting it to a specific task or dataset by training it further on domain-specific data. In the context of NLP, fine-tuning is often required to ensure that the language model understands the nuances of a particular language, industry-specific terminology, or a specific use case. Why Fine-Tuning is Necessary Pre-trained Large Language Models (LLMs) are trained on diverse datasets and can handle various tasks like text summarization, generation, and question answering. However, they may not perform optimally in specialized domains without fine-tuning. The goal of fine-tuning is to enhance the model's performance on specific tasks by leveraging its prior knowledge while adapting it to new contexts. Challenges of Fine-Tuning Resource Intensiveness: Fine-tuning large models can be computationally expensive, requiring significant hardware resources. Storage Costs: Each fine-tuned model can be large, leading to increased storage needs when deploying multiple models for different tasks. LoRA and QLoRA To address these challenges, techniques like LoRA (Low-rank Adaptation) and QLoRA (Quantized Low-rank Adaptation) have emerged. Both methods aim to make the fine-tuning process more efficient: LoRA: This technique reduces the number of trainable parameters by introducing low-rank matrices into the model while keeping the original model weights frozen. This approach minimizes memory usage and speeds up the fine-tuning process. QLoRA: An enhancement of LoRA, QLoRA incorporates quantization techniques to further reduce memory requirements and increase the efficiency of the fine-tuning process. It allows for the deployment of large models on consumer hardware without the extensive resource demands typically associated with full fine-tuning. from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments from peft import get_peft_model, LoraConfig # Load a pre-trained model model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") # Configure LoRA lora_config = LoraConfig( r=16, # Rank lora_alpha=32, lora_dropout=0.1, ) # Wrap the model with LoRA model = get_peft_model(model, lora_config) # Define training arguments training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, ) # Create a Trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) # Start fine-tuning trainer.train() This code outlines how to set up a model for fine-tuning using LoRA, which can significantly reduce the resource requirements while still adapting the model effectively to specific tasks. In summary, fine-tuning with methods like LoRA and QLoRA is essential for optimizing pre-trained models for specific applications in NLP, making it feasible to deploy these powerful models in various domains efficiently. Why is Fine-Tuning Important in NLP? Task-Specific Performance: Fine-tuning helps improve performance for tasks like text classification, machine translation, or sentiment analysis in specific domains (e.g., legal, healthcare). Language-Specific Adaptation: Since models like Phi-3.5 are trained on general datasets, fine-tuning helps them handle industry-specific jargon or linguistic quirks. Efficient Resource Utilization: Instead of training a model from scratch, fine-tuning leverages pre-trained knowledge, saving computational resources and time. Steps to Fine-Tune Phi-3.5 in Azure AI Foundry Fine-tuning the Phi-3.5 model in Azure AI Foundry involves several key steps. Azure provides a user-friendly interface to streamline model customization, allowing you to quickly configure, train, and deploy models. Step 1: Setting Up the Environment in Azure AI Foundry Access Azure AI Foundry: Log in to Azure AI Foundry. If you don’t have an account, you can create one and set up a workspace. Create a New Experiment: Once in the Azure AI Foundry, create a new training experiment. Choose the Phi-3.5 model from the pre-trained models provided in the Azure Model Zoo. Set Up the Data for Fine-Tuning: Upload your custom dataset for fine-tuning. Ensure the dataset is in a compatible format (e.g., CSV, JSON). For instance, if you are fine-tuning the model for a customer service chatbot, you could upload customer queries in different languages. Step 2: Configure Fine-Tuning Settings Select the Training Dataset: Select the dataset you uploaded and link it to the Phi-3.5 model. 2) Configure the Hyperparameters: Set up training hyperparameters like the number of epochs, learning rate, and batch size. You may need to experiment with these settings to achieve optimal performance. 3) Choose the Task Type: Specify the task you are fine-tuning for, such as text classification, translation, or summarization. This helps Azure AI Foundry understand how to optimize the model during fine-tuning. 4) Fine-Tuning for Specific Languages: If you are fine-tuning for a specific language or multilingual tasks, ensure that the dataset is labeled appropriately and contains enough examples in the target language(s). This will allow Phi-3.5 to learn language-specific features effectively. Step 3: Train the Model Launch the Training Process: Once the configuration is complete, launch the training process in Azure AI Foundry. Depending on the size of your dataset and the complexity of the model, this could take some time. Monitor Training Progress: Use Azure AI Foundry’s built-in monitoring tools to track performance metrics such as loss, accuracy, and F1 score. You can view the model’s progress during training to ensure that it is learning effectively. Code Example: Fine-Tuning Phi-3.5 for a Specific Use Case Here's a code snippet for fine-tuning the Phi-3.5 model using Python and Azure AI Foundry SDK. In this example, we are fine-tuning the model for a customer support chatbot in multiple languages. from azure.ai import Foundry from azure.ai.model import Model # Initialize Azure AI Foundry foundry = Foundry() # Load the Phi-3.5 model model = Model.load("phi-3.5") # Set up the training dataset training_data = foundry.load_dataset("customer_queries_dataset") # Fine-tune the model model.fine_tune(training_data, epochs=5, learning_rate=0.001) # Save the fine-tuned model model.save("fine_tuned_phi_3.5") Best Practices for Evaluating and Validating Fine-Tuned Models Once the model is fine-tuned, it's essential to evaluate and validate its performance before deploying it in production. Split Data for Validation: Always split your dataset into training and validation sets. This ensures that the model is evaluated on unseen data to prevent overfitting. Evaluate Key Metrics: Measure performance using key metrics such as: Accuracy: The proportion of correct predictions. F1 Score: A measure of precision and recall. Confusion Matrix: Helps visualize true vs. false predictions for classification tasks. Cross-Language Validation: If the model is fine-tuned for multiple languages, test its performance across all supported languages to ensure consistency and accuracy. Test in Production-Like Environments: Before full deployment, test the fine-tuned model in a production-like environment to catch any potential issues. Continuous Monitoring and Re-Fine-Tuning: Once deployed, continuously monitor the model’s performance and re-fine-tune it periodically as new data becomes available. Deploying Phi-3.5 Model After fine-tuning the Phi-3.5 model, the next crucial step is deploying it to make it accessible for real-world applications. This section will cover two key deployment strategies: deploying in Azure for cloud-based scaling and reliability, and deploying locally with AI Toolkit for simpler offline usage. Each deployment strategy offers its own advantages depending on the use case. Deploying in Azure Azure provides a powerful environment for deploying machine learning models at scale, enabling organizations to deploy models like Phi-3.5 with high availability, scalability, and robust security features. Azure AI Foundry simplifies the entire deployment pipeline. Set Up Azure AI Foundry Workspace: Log in to Azure AI Foundry and navigate to the workspace where the Phi-3.5 model was fine-tuned. Go to the Deployments section and create a new deployment environment for the model. Choose Compute Resources: Compute Target: Select a compute target suitable for your deployment. For large-scale usage, it’s advisable to choose a GPU-based compute instance. Example: Choose an Azure Kubernetes Service (AKS) cluster for handling large-scale requests efficiently. Configure Scaling Options: Azure allows you to set up auto-scaling based on traffic. This ensures that the model can handle surges in demand without affecting performance. Model Deployment Configuration: Create an Inference Pipeline: In Azure AI Foundry, set up an inference pipeline for your model. Specify the Model: Link the fine-tuned Phi-3.5 model to the deployment pipeline. Deploy the Model: Select the option to deploy the model to the chosen compute resource. Test the Deployment: Once the model is deployed, test the endpoint by sending sample requests to verify the predictions. Configuration Steps (Compute, Resources, Scaling) During deployment, Azure AI Foundry allows you to configure essential aspects like compute type, resource allocation, and scaling options. Compute Type: Choose between CPU or GPU clusters depending on the computational intensity of the model. Resource Allocation: Define the minimum and maximum resources to be allocated for the deployment. For real-time applications, use Azure Kubernetes Service (AKS) for high availability. For batch inference, Azure Container Instances (ACI) is suitable. Auto-Scaling: Set up automatic scaling of the compute instances based on the number of requests. For example, configure the deployment to start with 1 node and scale to 10 nodes during peak usage. Cost Comparison: Phi-3.5 vs. Larger Language Models When comparing the costs of using Phi-3.5 with larger language models (LLMs), several factors come into play, including computational resources, pricing structures, and performance efficiency. Here’s a breakdown: Cost Efficiency Phi-3.5: Designed as a Small Language Model (SLM), Phi-3.5 is optimized for lower computational costs. It offers competitive performance at a fraction of the cost of larger models, making it suitable for budget-conscious projects. The smaller size (3.8 billion parameters) allows for reduced resource consumption during both training and inference. Larger Language Models (e.g., GPT-3.5): Typically require more computational resources, leading to higher operational costs. Larger models may incur additional costs for storage and processing power, especially in cloud environments. Performance vs. Cost Performance Parity: Phi-3.5 has been shown to achieve performance parity with larger models on various benchmarks, including language comprehension and reasoning tasks. This means that for many applications, Phi-3.5 can deliver similar results to larger models without the associated costs. Use Case Suitability: For simpler tasks or applications that do not require extensive factual knowledge, Phi-3.5 is often the more cost-effective choice. Larger models may still be preferred for complex tasks requiring deep contextual understanding or extensive factual recall. Pricing Structure Azure Pricing: Phi-3.5 is available through Azure with a pay-as-you-go billing model, allowing users to scale costs based on usage. Pricing details for Phi-3.5 can be found on the Azure pricing page, where users can customize options based on their needs. Code Example: API Setup and Endpoints for Live Interaction Below is a Python code snippet demonstrating how to interact with a deployed Phi-3.5 model via an API in Azure: import requests # Define the API endpoint and your API key api_url = "https://<your-azure-endpoint>/predict" api_key = "YOUR_API_KEY" # Prepare the input data input_data = { "text": "What are the benefits of renewable energy?" } # Make the API request response = requests.post(api_url, json=input_data, headers={"Authorization": f"Bearer {api_key}"}) # Print the model's response if response.status_code == 200: print("Model Response:", response.json()) else: print("Error:", response.status_code, response.text) Deploying Locally with AI Toolkit For developers who prefer to run models on their local machines, the AI Toolkit provides a convenient solution. The AI Toolkit is a lightweight platform that simplifies local deployment of AI models, allowing for offline usage, experimentation, and rapid prototyping. Deploying the Phi-3.5 model locally using the AI Toolkit is straightforward and can be used for personal projects, testing, or scenarios where cloud access is limited. Introduction to AI Toolkit The AI Toolkit is an easy-to-use platform for deploying language models locally without relying on cloud infrastructure. It supports a range of AI models and enables developers to work in a low-latency environment. Advantages of deploying locally with AI Toolkit: Offline Capability: No need for continuous internet access. Quick Experimentation: Rapid prototyping and testing without the delays of cloud deployments. Setup Guide: Installing and Running Phi-3.5 Locally Using AI Toolkit Install AI Toolkit: Go to the AI Toolkit website and download the platform for your operating system (Linux, macOS, or Windows). Install AI Toolkit by running the appropriate installation command in your terminal. Download the Phi-3.5 Model: Once AI Toolkit is installed, you can download the Phi-3.5 model locally by running: 3. Run the Model Locally: After downloading the model, start a local session by running: This will launch a local server on your machine where the model will be available for interaction. Code Example: Using Phi-3.5 Locally in a Project Below is a Python code example demonstrating how to send a query to the locally deployed Phi-3.5 model running on the AI Toolkit. import requests # Define the local endpoint local_url = "http://localhost:8000/predict" # Prepare the input data input_data = { "text": "What are the benefits of renewable energy?" } # Make the API request response = requests.post(local_url, json=input_data) # Print the model's response if response.status_code == 200: print("Model Response:", response.json()) else: print("Error:", response.status_code, response.text) Comparing Language Capabilities Test Results: How Phi-3.5 Handles Different Languages The Phi-3.5 model demonstrates robust multilingual capabilities, effectively processing and generating text in various languages. Below are comparative examples showcasing its performance in English, Spanish, and Mandarin: English Example: Input: "What are the benefits of renewable energy?" Output: "Renewable energy sources, such as solar and wind, reduce greenhouse gas emissions and promote sustainability." Spanish Example: Input: "¿Cuáles son los beneficios de la energía renovable?" Output: "Las fuentes de energía renovable, como la solar y la eólica, reducen las emisiones de gases de efecto invernadero y promueven la sostenibilidad." Mandarin Example: Input: "可再生能源的好处是什么?" Output: "可再生能源,如太阳能和风能,减少温室气体排放,促进可持续发展。" Performance Benchmarking and Evaluation Across Different Languages Benchmarking Phi-3.5 across different languages involves evaluating its accuracy, fluency, and contextual understanding. For instance, using BLEU scores and human evaluations, the model can be assessed on its translation quality and coherence in various languages. Real-World Use Case: Multilingual Customer Service Chatbot A practical application of Phi-3.5's multilingual capabilities is in developing a customer service chatbot that can interact with users in their preferred language. For instance, the chatbot could provide support in English, Spanish, and Mandarin, ensuring a wider reach and better user experience. Optimizing and Validating Phi-3.5 Model Model Performance Metrics To validate the model's performance in different scenarios, consider the following metrics: Accuracy: Measure how often the model's outputs are correct or align with expected results. Fluency: Assess the naturalness and readability of the generated text. Contextual Understanding: Evaluate how well the model understands and responds to context-specific queries. Tools to Use in Azure and Ollama for Evaluation Azure Cognitive Services: Utilize tools like Text Analytics and Translator to evaluate performance. Ollama: Use local testing environments to quickly iterate and validate model outputs. Conclusion In summary, Phi-3.5 exhibits impressive multilingual capabilities, effective deployment options, and robust performance metrics. Its ability to handle various languages makes it a versatile tool for natural language processing applications. Phi-3.5 stands out for its adaptability and performance in multilingual contexts, making it an excellent choice for future NLP projects, especially those requiring diverse language support. We encourage readers to experiment with the Phi-3.5 model using Azure AI Foundry or the AI Toolkit, explore fine-tuning techniques for their specific use cases, and share their findings with the community. For more information on optimized fine-tuning techniques, check out the Ignite Fine-Tuning Workshop. References Customize the Phi-3.5 family of models with LoRA fine-tuning in Azure Fine-tune Phi-3.5 models in Azure Fine Tuning with Azure AI Foundry and Microsoft Olive Hands on Labs and Workshop Customize a model with fine-tuning https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=azure-openai%2Cturbo%2Cpython-new&pivots=programming-language-studio Microsoft AI Toolkit - AI Toolkit for VSCode1.8KViews1like2CommentsModel Mondays S2E01 Recap: Advanced Reasoning Session
About Model Mondays Want to know what Reasoning models are and how you can build advanced reasoning scenarios like a Deep Research agent using Azure AI Foundry? Check out this recap from Model Mondays Season 2 Ep 1. Model Mondays is a weekly series to help you build your model IQ in three steps: 1. Catch the 5-min Highlights on Monday, to get up to speed on model news 2. Catch the 15-min Spotlight on Monday, for a deep-dive into a model or tool 3. Catch the 30-min AMA on Friday, for a Q&A session with subject matter experts Want to follow along? Register Here- to watch upcoming livestreams for Season 2 Visit The Forum- to see the full AMA schedule for Season 2 Register Here - to join the AMA on Friday Jun 20 Spotlight On: Advanced Reasoning This week, the Model Mondays spotlight was on Advanced Reasoning with subject matter expert Marlene Mhangami. In this blog post, I'll talk about my five takeaways from this episode: Why Are Reasoning Models Important? What Is an Advanced Reasoning Scenario? How Can I Get Started with Reasoning Models ? Spotlight: My Aha Moment Highlights: What’s New in Azure AI 1. Why Are Reasoning Models Important? In today's fast-evolving AI landscape, it's no longer enough for models to just complete text or summarize content. We need AI that can: Understand multi-step tasks Make decisions based on logic Plan sequences of actions or queries Connect context across turns Reasoning models are large language models (LLMs) trained with reinforcement learning techniques to "think" before they answer. Rather than simply generating a response based on probability, these models follow an internal thought process producing a chain of reasoning before responding. This makes them ideal for complex problem-solving tasks. And they’re the foundation of building intelligent, context-aware agents. They enable next-gen AI workflows in everything from customer support to legal research and healthcare diagnostics. Reason: They allow AI to go beyond surface-level response and deliver solutions that reflect understanding, not just language patterning. 2. What does Advanced Reasoning involve? An advanced reasoning scenario is one where a model: Breaks a complex prompt into smaller steps Retrieves relevant external data Uses logic to connect dots Outputs a structured, reasoned answer Example: A user asks: What are the financial and operational risks of expanding a startup to Southeast Asia in 2025? This is the kind of question that requires extensive research and analysis. A reasoning model might tackle this by: Retrieving reports on Southeast Asia market conditions Breaking down risks into financial, political, and operational buckets Cross-referencing data with recent trends Returning a reasoned, multi-part answer 3. How Can I Get Started with Reasoning Models? To get started, you need to visit a catalog that has examples of these models. Try the GitHub Models Marketplace and look for the reasoning category in the filter. Try the Azure AI Foundry model catalog and look for reasoning models by name. Example: The o-series of models from Azure Open AI The DeepSeek-R1 models The Grok 3 models The Phi-4 reasoning models Next, you can use SDKs or Playground for exploring the model capabiliies. 1. Try Lab 331 - for a beginner-friendly guide. 2. Try Lab 333 - for an advanced project. 3. Try the GitHub Model Playground - to compare reasoning and GPT models. 4. Try the Deep Research Agent using LangChain - sample as a great starting project. Have questions or comments? Join the Friday AMA on Azure AI Foundry Discord: 4. Spotlight: My Aha Moment Before this session, I thought reasoning meant longer or more detailed responses. But this session helped me realize that reasoning means structured thinking — models now plan, retrieve, and respond with logic. This inspired me to think about building AI agents that go beyond chat and actually assist users like a teammate. It also made me want to dive deeper into LangChain + Azure AI workflows to build mini-agents for real-world use. 5. Highlights: What’s New in Azure AI Here’s what’s new in the Azure AI Foundry: Direct From Azure Models - Try hosted models like OpenAI GPT on PTU plans SORA Video Playground - Generate video from prompts via SORA models Grok 3 Models - Now available for secure, scalable LLM experiences DeepSeek R1-0528 - A reasoning-optimized, Microsoft-tuned open-source model These are all available in the Azure Model Catalog and can be tried with your Azure account. Did You Know? Your first step is to find the right model for your task. But what if you could have the model automatically selected for you_ based on the prompt you provide? That's the magic of Model Router a deployable AI chat model that dynamically selects the best LLM based on your prompt. Instead of choosing one model manually, the Router makes that choice in real time. Currently, this works with a fixed set of Azure OpenAI models, including a reasoning model option. Keep an eye on the documentation for more updates. Why it’s powerful: Saves cost by switching between models based on complexity Optimizes performance by selecting the right model for the task Lets you test and compare model outputs quickly Try it out in Azure AI Foundry or read more in the Model Catalog Coming Up Next Next week, we dive into Model Context Protocol, an open protocol that empowers agentic AI applications by making it easier to discover and integrate knowledge and action tools with your model choices. Register Here to get reminded - and join us live on Monday! Join The Community Great devs don't build alone! In a fast-pased developer ecosystem, there's no time to hunt for help. That's why we have the Azure AI Developer Community. Join us today and let's journey together! Join the Discord - for real-time chats, events & learning Explore the Forum - for AMA recaps, Q&A, and help! About Me. I'm Sharda, a Gold Microsoft Learn Student Ambassador interested in cloud and AI. Find me on Github, Dev.to,, Tech Community and Linkedin. In this blog series I have summarizef my takeaways from this week's Model Mondays livestream .521Views0likes0CommentsModel Mondays S2:E2 - Understanding Model Context Protocol (MCP)
This week in Model Mondays, we focus on the Model Context Protocol (MCP) — and learn how to securely connect AI models to real-world tools and services using MCP, Azure AI Foundry, and industry-standard authorization. Read on for my recap About Model Mondays Model Mondays is a weekly series designed to help you build your Azure AI Foundry Model IQ step by step. Here’s how it works: 5-Minute Highlights – Quick news and updates about Azure AI models and tools on Monday 15-Minute Spotlight – Deep dive into a key model, protocol, or feature on Monday 30-Minute AMA on Friday – Live Q&A with subject matter experts from Monday livestream If you want to grow your skills with the latest in AI model development, Model Mondays is the place to start. Want to follow along? Register Here - to watch upcoming Mondel Monday livestreams Watch Playlists to replay past Model Monday episodes Register Here - to join the AMA on MCP on Friday Jun 27 Visit The Forum- to view Foundry Friday AMAs and recaps Spotlight On: Model Context Protocol (MCP) This week, the Model Monday’s spotlight was on the Model Context Protocol (MCP) with subject matter expert Den Delimarsky. Don't forget to check out the slides from the presentation, for resource links! In this blog post, I’ll talk about my five key takeaways from this episode: What Is MCP and Why Does It Matter? What Is MCP Authorization and Why Is It Important? How Can I Get Started with MCP? Spotlight: My Aha Moment Highlights: What’s New in Azure AI 1 . What Is MCP and Why is it Important? MCP is a protocol that standardizes how AI applications connect the underlying AI models to required knowledge sources (data) and interaction APIs (functions) for more effective task execution. Because these models are pre-trained, they lack access to real-time or proprietary data sources (for knowledge) and real-world environments (for interaction). MCP allows them to "discover and use" relevant knowledge and action tools to add relevant context to the model for task execution. Explore: The MCP Specification Learn: MCP For Beginners Want to learn more about MCP - check out the AI Engineer World Fair 2025 "MCP and Keynotes" track. It kicks off with a keynote from Asha Sharma that gives you a broader vision for Azure AI Foundry. Then look for the talk from Harald Kirschner on MCP and VS Code. 2. What Is MCP Authorization and Why Does It Matter? MCP (Model Context Protocol) authorization is a system that helps developers manage who can access their apps, especially when they are hosted in the cloud. The goal is to simplify the process of securing these apps by using common tools like OAuth and identity providers (such as Google or GitHub), so developers don't have to be security experts. Key Takeaways: The new MCP proposal uses familiar identity providers to simplify the authorization process. It allows developers to secure their apps without requiring deep knowledge of security. The update ensures better security controls and prepares the system for future authentication methods. Related Reading: Aaron Parecki, Let's Fix OAuth in MCP Den Delimarsky, Improving The MCP Authorization Spec - One RFC At A Time MCP Specification, Authorization protocol draft On Monday, Den joined us live to talk about the work he did for the authorization protocol. Watch the session now to get a sense for what the MCP Authorization protocol does, how it works, and why it matters. Have questions? Submit them to the forum or Join the Foundry Friday AMA on Jun 27 at 1:30pm ET. 3. How Can I Get Started? If you want to start working with MCP, here’s how to do it easily: Learn the Fundamentals: Explore MCP For Beginners Use an MCP Server: Explore VSCode Agent Mode support . Use MCP with AI Agents: Explore the Azure MCP Server 4. What’s New in Azure AI Foundry? Managed Compute for Cohere Models: Faster, secure AI deployments with low latency. Prompt Shields: New Azure security system to protect against prompt injection and unsafe content. OpenAI o3 Pro Model: A fast, low-cost model similar to GPT-4 Turbo. Codex Mini Model: A smaller, quicker model perfect for developer command-line tasks. MCP Security Upgrades: Now easier to secure AI apps using familiar OAuth identity providers. 5. My Aha Moment Before this session, I used to think that connecting apps to AI was complicated and risky. I believed developers had to build their own security systems from scratch, which sounded tough. But this week, I learned that MCP makes it simple. We can now use trusted logins like Google or GitHub and securely connect AI models to real-world apps without extra hassle. How I Learned This ? To be honest, I also used Copilot to help me understand and summarize this topic in simple words. I wanted to make sure I really understood it well enough to explain it to my friends and peers. I believe in learning with the tools we have, and AI is one of them. By using Copilot and combining it with what I learned from the Model Monday’s session, I was able to write this blog in a way that is easy to understand Takeaway for Beginners: It’s okay to use AI to learn what matters is that you grow, verify, and share the knowledge in your own way. Coming Up Next Week: Next week, we dive into SLMs & Reasoning (Phi-4) with Mojan Javaheripi, PhD, Senior Researcher at Microsoft Research. This session will explore how Small Language Models (SLMs) can perform advanced reasoning tasks, and what makes models like Phi-4 reasoning efficient, scalable, and useful in practical AI applications. Register Here! Join The Community Great devs don't build alone! In a fast-pased developer ecosystem, there's no time to hunt for help. That's why we have the Azure AI Developer Community. Join us today and let's journey together! Join the Discord - for real-time chats, events & learning Explore the Forum - for AMA recaps, Q&A, and help! About Me: I'm Sharda, a Gold Microsoft Learn Student Ambassador interested in cloud and AI. Find me on Github, Dev.to, Tech Community and Linkedin. In this blog series I have summarized my takeaways from this week's Model Mondays livestream.1.1KViews1like2CommentsModel Mondays S2:E3 Understanding SLMs and Reasoning with Mojan Javaheripi
This week in Model Mondays, we focus on Small Language Models (SLMs) and Reasoning — and learn how reasoning models leverage inference-time scaling to execute complex tasks, but how can we use these in resource-constrained devices? Read on for my recap of Mojan Javaheripi's insights on Phi-4 reasoning models that are redefining small language models (SLM) for the agentic era of apps. About Model Mondays Model Mondays is a weekly series designed to help you build your Azure AI Foundry Model IQ step by step. Here's how it works: 5-Minute Highlights – Quick news and updates about Azure AI models and tools on Monday 15-Minute Spotlight – Deep dive into a key model, protocol, or feature on Monday 30-Minute AMA on Friday – Live Q&A with subject matter experts from Monday livestream If you want to grow your skills with the latest in AI model development, Model Mondays is the place to start. Want to follow along? Register Here - to watch upcoming Model Monday livestreams Watch Playlists to replay past Model Monday episodes Register Here- to join the AMA on SLMs and Reasoning on Friday Jul 03 Visit The Forum - to view Foundry Friday AMAs and recaps This post was generated with AI help and human revision & review. To learn more about our motivation and workflows, please refer to this document in our website. We are continuing to experiment with ideas here - feedback is welcome! Just drop us a comment and let us know! Spotlight On: SLMs and Reasoning Missed watching the livestream? Catch up on the episode below - and visit https://aka.ms/model-mondays/playlist to catch up on all the previous episodes in the series. And check out the Discussion Forum post here for all the resources and updates from the AMA and more. 1. What is this topic and why is it important? Small Language Models (SLMs) like Phi-4 represent a breakthrough in making advanced reasoning capabilities accessible on resource-constrained devices. While large language models require massive computational resources, SLMs can deliver sophisticated reasoning while running on edge devices, mobile phones, and local hardware. This is crucial because it democratizes AI access, reduces latency, and enables privacy-preserving applications where data doesn't need to leave the device. Reasoning models use inference-time scaling, meaning they can "think" longer about complex problems to arrive at better solutions. Phi-4 specifically excels at mathematical reasoning, code generation, and logical problem-solving while maintaining a smaller footprint than traditional large models. 2. What is one key takeaway from the episode? The key insight is that Phi-4 proves that model size doesn't always correlate with reasoning capability. Through advanced training techniques and architectural improvements, SLMs can achieve reasoning performance that rivals much larger models while being practical for deployment in real-world, resource-constrained environments. This opens up entirely new possibilities for agentic applications that can run locally and respond quickly. 3. How can I get started? To get started with SLMs and reasoning: 1. Explore Phi-4 on Azure AI Foundry Model Catalog 2. Try the reasoning capabilities in Azure AI Foundry Playground 3. Download and experiment with Phi-4 for local development 4. Check out the sample applications and use cases in the Azure AI Foundry documentation What's new in Azure AI Foundry? Azure AI Foundry continues to evolve to support the growing ecosystem of Small Language Models and agentic apps. Recently, new capabilities have been added to make it easier to fine-tune SLMs like Phi-4 directly in Azure AI Studio. Updates include: Enhanced Model Catalog: Easier discovery of SLMs, reasoning models, and multi-modal models. Improved Prompt Flow Integration: Now with templates specifically designed for SLM-based reasoning tasks. New Evaluation Tools: Built-in model comparison dashboards to quickly test reasoning performance across different SLM variants. Edge Deployment Support: Simplified workflows for packaging and deploying SLMs to local devices and edge environments. Want to get a summary of ALL the news from Jun 2025? Just visit this post on Azure AI Foundry Discussions Forum for all the links! My A-Ha Moment The biggest Aha moment for me was realizing that a model doesn’t need to be huge to be smart. Phi-4 proved that small models can actually handle complex reasoning tasks just like big models. What really clicked for me: You don’t need heavy GPUs or cloud servers. These models can run on mobile phones, edge devices, or small local machines. Your data stays on your device, which is great for privacy and faster responses. It’s a total game changer because now we can build intelligent apps that work even on low-resource devices. Coming up Next Week Next week, we dive into AI Developer Experiences with Leo Yao. We'll explore how to streamline the AI developer journey from model selection and usage to evaluation and app deployment using the AI Toolkit and Azure AI Foundry extensions for Visual Studio Code. Discover the key capabilities they provide for generative AI app & agent development. Register Here! to be notified - then watch live on YouTube below. Join The Community Great devs don't build alone! In a fast-pased developer ecosystem, there's no time to hunt for help. That's why we have the Azure AI Developer Community. Join us today and let's journey together! 1. Join the Discord - for real-time chats, events & learning 2. Explore the Forum - for AMA recaps, Q&A, and help! About Me: I'm Sharda, a Gold Microsoft Learn Student Ambassador interested in cloud and AI. Find me on Github, Dev.to, Tech Community and Linkedin. In this blog series I have summarized my takeaways from this week's Model Mondays livestream.309Views0likes0CommentsModel Mondays S2:E6 Understanding Research & Innovation with SeokJin Han and Saumil Shrivastava
In this week's blog post, we dive into the cutting-edge research happening at Azure AI Foundry Labs. From the MCP Server that makes it easy to experiment with new models and tools, to Magentic-UI that brings human-centered agent workflows to life, there’s a lot to unpack!244Views0likes0Comments