ai foundry
110 TopicsGitHub Action for Deploying Hosted Agents
Introduction With Microsoft's introduction to Hosted Agents comes a next logical question. How to implement this? Organizations need a method that is quick, repeatable, and requires minimal adjustments to their existing tooling and processes. Thus, we will walk through how to deploy a Hosted Agent through a repeatable GitHub Action. If this is new to you this blog is a follow up to Deploying Foundry Hosted Agents via REST API | Microsoft Community Hub. Before You Start This action assumes the following are already in place in the workflow that calls it: An existing Microsoft Foundry project with a deployed model. A container image already pushed to Azure Container Registry (ACR). An identity with the **Foundry User** role on the Foundry project. See [hosted agent permissions](https://learn.microsoft.com/en-us/azure/foundry/agents/concepts/hosted-agent-permissions) for the full permissions reference. A runner with `az`, `jq`, and `python3` installed. This is true on `ubuntu-latest`; if you self-host, install them explicitly. azure/login configured in the caller workflow **before** this action runs. ⚠️ *Identity prerequisite This action assumes `azure/login` has already run in the caller workflow and that the resulting identity holds a Foundry data-plane role (e.g., Foundry User). Without that, `az account get-access-token` will fail before the REST call is made. Requirements Grounding ourselves in our requirements to implement the deployment processes, in the quickest way that leverages minimal adjustments and a repeatable process, we will leverage GitHub Action and Bash. The Bash script will take a series of arguments that will be used to call the REST API. The action requires four inputs: `project_endpoint`, `agent_name`, `image`, and `model_deployment_name`. The example pipeline wires these from the outputs of a preceding IaC step, but the action itself takes plain strings. These strings can come from any tool that can hand them off as workflow inputs. This keeps it flexible and limits adjustments to existing CI/CD processes. If interested, one can use the Azure Developer CLI (`azd up`) command which is documented via Microsoft official examples and MS Learn. This blog chose not to cover this as the majority of enterprise customers already have tooling they are leveraging other than `azd`. Also, one could use the `azure.ai.projects` library to create an agent. This blog made the decision not to go down this route as not all organizations have adopted the philosophy of allowing application code to create underlying compute infrastructure. Additionally, some organizations desire teams outside of developers to control and set the size of the Micro VM (referred to as the "sandbox" in the Foundry docs) that the Hosted Agent is running on. If your organization does not use GitHub Actions this step should be duplicatable in Azure DevOps leveraging the Bash task. Deployment Steps For us to do this appropriately let's take a step back and evaluate a CI/CD workflow for an Agent whose definition is stored in a container. Ideally a pipeline should follow steps outlined in CI/CD for AI Agents on Microsoft Foundry. Those pipelines typically take the shape build/push → IaC → update agent → smoke test. For our purposes, since we are hyper-focusing on the Hosted Agent Deployment via REST API we are going to focus on the repeatable GitHub Action of deploying the agent. To emphasize this our workflow will focus on the step called "Update agent — Foundry data plane POST `agents/NAME/versions`". Based on organization preference, I can understand the need to break out the update agent step into a separate workflow. We traditionally don't recommend this as keeping everything in one pipeline means one set of failures to triage, one history to read, and one CI/CD surface to keep current. but This action though is structured to support a split if your release process requires it. Hosted Agent REST Deployment Action This is the crux of why the article exists. If you've followed my style of repeatable DevOps process for YAML Pipelines, this action follows similar principles. We will parametrize with defaults to empower minimal configuration while also optimizing for flexibility. To view the full example check out the Update Foundry Agent action . The Inputs, Outputs, and `runs:` blocks shown below all live in a single file: `.github/actions/update-agent/action.yml`. Inputs Here are those parameters with descriptions and defaults: inputs: project_endpoint: description: Foundry project endpoint URL required: true agent_name: description: Name of the hosted agent required: true image: description: Full container image reference (registry/name:tag) required: true model_deployment_name: description: Name of the AI model deployment required: true cpu: description: CPU allocation for the agent container required: false default: '0.25' memory: description: Memory allocation for the agent container required: false default: '0.5Gi' Verify the latest sandbox sizes at hosted-agents#sandbox-sizes There is also guidance on right-sizing your Micro VMs. At the time of this writing here are the available combinations: Outputs We should output values that make sense for subsequent steps in the workflow. Every instance that calls this action may not use them, but it's always good to expose non-secret values just in case. In our case we are creating a new version of the agent, so let's output that agent version: outputs: agent_version: description: Version ID returned by the Foundry data plane value: ${{ steps.post.outputs.agent_version }} `agent_version` is the version identifier returned by the data plane. Capture this in your pipeline (artifact, release tag, etc.) so you have an audit trail and a target to re-deploy against if a future version needs to be rolled back. Subsequent steps in the workflow can reference it via `${{ steps.<step-id>.outputs.agent_version }}`. Action The action will need to map our environment variables being passed into the input as the first step. After that we will need to get an access token from Azure so we can then call the REST API endpoint. Once we have this, we will need to prepare the body of our call. Verify against the API for all valid properties. For our example I chose not to set `rai_config` (Responsible AI overview) and `tools` (function/tool bindings) to keep things simple. runs: using: composite steps: - name: Post agent version to Foundry data plane id: post shell: bash env: PROJECT_ENDPOINT: ${{ inputs.project_endpoint }} AGENT_NAME: ${{ inputs.agent_name }} IMAGE: ${{ inputs.image }} MODEL_DEPLOYMENT_NAME: ${{ inputs.model_deployment_name }} CPU: ${{ inputs.cpu }} MEMORY: ${{ inputs.memory }} run: | FOUNDRY_TOKEN=$(az account get-access-token \ --resource "https://ai.azure.com/" \ --query accessToken -o tsv) AGENT_REQUEST_BODY=$(jq -n \ --arg cpu "$CPU" \ --arg memory "$MEMORY" \ --arg model "$MODEL_DEPLOYMENT_NAME" \ --arg image "$IMAGE" \ '{ definition: { kind: "hosted", container_protocol_versions: [{protocol: "responses", version: "1.0.0"}], cpu: $cpu, memory: $memory, environment_variables: {AZURE_AI_MODEL_DEPLOYMENT_NAME: $model}, image: $image ⚠️ **Heads up on logs.** The line that echoes `HTTP ${HTTP_STATUS}: $(cat /tmp/agent_response.json)` dumps the full response body to the job log. If your request body contains sensitive `environment_variables`, the API may return them in the response, where they will appear in plain text in the workflow log. Either scrub the response before echoing, or echo only the `version` field on success. A 2xx response confirms the data plane accepted the new agent version. Confirming the agent behaves as intended is a separate step. This is done typically with a smoke test against the deployed agent in a later workflow job. If something goes wrong the most common failures are: 401/403- `azure/login` didn't run, the identity is missing a Foundry data-plane role, or the wrong subscription is selected. Check the `azure/login` step and confirm the identity holds **Foundry User** (or higher) on the Foundry project (see the *Before You Start* callout above). 404 - wrong `project_endpoint`, or the agent named in `agent_name` does not yet exist on the project. The agent must exist before posting a new version. 400 - body or model issue: invalid `cpu` / `memory` shape, a required field missing, or `model_deployment_name` pointing at a deployment that isn't reachable from this project. Calling the Action So now that we have the action, how can we scale this across multiple workflows? Simple, we just need to pass in the required parameters. Here is an example, with a stubbed `deploy-iac` step so can the outputs passed into the action as inputs: - name: Deploy Bicep infrastructure id: deploy-iac uses: ./.github/actions/deploy-bicep with: environment_name: ${{ inputs.environment_name || 'main' }} location: ${{ inputs.location || 'swedencentral' }} - name: Update agent uses: ./.github/actions/update-agent with: project_endpoint: ${{ steps.deploy-iac.outputs.project_endpoint }} agent_name: ${{ inputs.agent_name }} image: ${{ steps.deploy-iac.outputs.acr_endpoint }}/${{ inputs.image_name }}:${{ inputs.image_tag }} model_deployment_name: ${{ steps.deploy-iac.outputs.model_deployment_name }} And just to show we can call the same action multiple times here are two examples that do just that: Deploy (Bicep) and Deploy (Terraform). Conclusion The composite action shown above gives organizations what the introduction called for: a quick, repeatable way to deploy a Hosted Agent that requires minimal adjustments to the GitHub Actions tooling and processes already in use. With it wired into a workflow, deploying a new Hosted Agent version becomes a standard step in your pipeline.Infrastructure as Code for AI: Building and Deploying Microsoft Hosted Agents with Terraform
AI agents are no longer experimental. Teams are shipping production-grade agents that retrieve information, call APIs, reason over documents, and orchestrate multi-step workflows at scale. Microsoft Foundry's Hosted Agents service gives you a fully managed runtime for those agents, built on top of the Microsoft Foundry Agent Service, with Microsoft handling the infrastructure, scaling, and runtime lifecycle. The challenge is that provisioning this infrastructure by hand or clicking through the portal, running one-off CLI commands, or relying on undocumented shell scripts, simply does not scale. It introduces configuration drift, makes reproducing environments painful, and creates real governance risk as teams grow. This post walks through how to provision and manage the Azure infrastructure required to run Microsoft Hosted Agents using Terraform. You will leave with working configuration, a clear understanding of the resource model, and practical guidance on where Terraform can take you all the way and where you will need to supplement with the Azure CLI or the Microsoft Foundry Agent Service SDK. What Are Microsoft Hosted Agents? Microsoft Hosted Agents are AI agents deployed and managed within Microsoft Foundry. Microsoft Foundry is Microsoft's unified platform for building, evaluating, and deploying AI applications and agents. It provides: A managed compute runtime — Microsoft provisions and scales the infrastructure so you do not manage VMs or containers. An agent execution environment — agents are defined with instructions, tools (code interpreter, Bing grounding, Azure AI Search, function calling), and a backing model endpoint. Deep Azure integration — identity via Microsoft Entra ID, secrets via Azure Key Vault, storage via Azure Blob, tracing via Azure Monitor and Application Insights. A project-scoped model — each Microsoft Foundry project encapsulates an agent's resources, connections, and deployments within a logical boundary. The "Hosted" distinction matters. You are not running agent code on your own Kubernetes cluster or App Service. Microsoft manages the runtime. Your responsibility is to provision the surrounding infrastructure correctly: the Microsoft Foundry resource, the project, the model deployment, the identity configuration, and the monitoring resources that back it all. That boundary — the infrastructure you own — is exactly what Terraform manages well. Why Terraform for Hosted Agent Deployments? Infrastructure as Code (IaC) is not a new idea, but its importance grows as AI deployments become more complex. Here is why Terraform is a strong choice for Microsoft Foundry deployments specifically: Repeatability: A Terraform configuration produces the same infrastructure every time. Staging mirrors production. Disaster recovery is a terraform apply away. Governance: Infrastructure definitions live in version control alongside application code. Changes are reviewable, auditable, and reversible. This satisfies most enterprise change-management requirements. Scale: Spinning up per-customer or per-team agent environments using Terraform workspaces or module instantiation is far more manageable than manual provisioning. State management: Terraform tracks the actual state of your Azure resources. It detects drift and reconciles it declaratively. Ecosystem: The AzureRM provider is mature, actively maintained by HashiCorp and Microsoft, and covers the majority of Azure services including the Microsoft Foundry resources. Architecture Overview Before writing any Terraform, it helps to understand the resource hierarchy in Microsoft Foundry and how each layer maps to an Azure resource type. The Foundry Resource Hierarchy Microsoft Foundry uses a two-level hierarchy: 1. Foundry Account ( azurerm_cognitive_account , kind: AIServices ) — The top-level AI Services resource. It provides the model endpoint, manages agent execution, and acts as the logical boundary for all projects beneath it. You must set project_management_enabled = true and provide a custom_subdomain_name to enable project creation. In ARM terms this is a Microsoft.CognitiveServices/accounts resource. 2. Foundry Project ( azurerm_cognitive_account_project ) — A child resource scoped within the Foundry Account. Each project has its own agents, model deployments, connections, and data assets. In production, you typically have one project per application, product team, or environment. Figure 1: The Microsoft Foundry resource hierarchy. A single Foundry Account (Cognitive Services, kind AIServices) acts as the top-level container, with Projects scoped beneath it — one per application, team, or environment. Supporting Resources The following Azure resources make up a complete Hosted Agents deployment: Microsoft Foundry Account (AI Services): A single azurerm_cognitive_account of kind AIServices serves as both the Foundry Account and the model endpoint host. Model deployments (e.g. gpt-4.1 ) are provisioned via azurerm_cognitive_deployment within this account. Log Analytics Workspace + Application Insights: Provides observability for agent traces, request logs, and metrics. User-Assigned Managed Identity: Grants the Foundry Account and Projects access to Azure resources without stored credentials. Role Assignments (RBAC): Wires the managed identity to the Foundry Account with least-privilege Cognitive Services permissions. Figure 2: Supporting infrastructure map. The managed identity holds least-privilege RBAC grants to the Microsoft Foundry Account (AI Services) — enabling model access and project management — all within the same resource group. Reference Architecture (Described) A production-ready layout separates concerns across two resource groups: one for shared infrastructure (networking, monitoring) and one for the Microsoft Foundry Account and its projects. The Foundry resource group houses the azurerm_cognitive_account (kind: AIServices) resource and the azurerm_cognitive_account_project instances. The shared resource group holds Log Analytics and Application Insights. A user-assigned managed identity spans both, holding RBAC grants to each backing service. For a dev/test environment you can collapse both into a single resource group. For production, the separation makes cost attribution, access control, and lifecycle management cleaner. Prerequisites Accounts and Permissions An active Azure subscription with the Owner or Contributor + User Access Administrator roles at the subscription or resource group level (role assignments require elevated permission). Foundry access enabled in your subscription. In some tenants you may need to accept terms or request quota for Azure OpenAI. Azure OpenAI quota for the model you intend to deploy (e.g. gpt-4.1 ). Request this via the Azure portal under Quotas in Azure OpenAI Studio. Local Tools Terraform CLI ≥ 1.9 — Install guide Azure CLI ≥ 2.60 — Install guide A code editor (VS Code with the HashiCorp Terraform extension and the Azure Terraform extension is a strong combination). Authentication For local development, authenticate via the Azure CLI. The AzureRM Terraform provider picks this up automatically: az login az account set --subscription "<your-subscription-id>" For CI/CD pipelines, use a service principal with AZURE_CLIENT_ID , AZURE_CLIENT_SECRET , AZURE_TENANT_ID , and AZURE_SUBSCRIPTION_ID environment variables, or — preferably — a workload identity federation (federated credentials) to avoid storing long-lived secrets. GitHub Actions supports OIDC-based workload identity natively. Terraform Fundamentals for Hosted Agents Provider Configuration The hashicorp/azurerm provider is your primary dependency. The new Microsoft Foundry resources ( azurerm_cognitive_account with kind = "AIServices" and azurerm_cognitive_account_project ) require version 4.x of the provider. Pin your version to avoid unexpected breaking changes: terraform { required_version = ">= 1.9" required_providers { azurerm = { source = "hashicorp/azurerm" version = "~> 4.0" } } } provider "azurerm" { features { key_vault { purge_soft_delete_on_destroy = false } resource_group { prevent_deletion_if_contains_resources = true } } subscription_id = var.subscription_id } The features block is required even when empty. The Key Vault setting prevents accidental secret loss during terraform destroy . The resource group setting adds an extra safety net in production. State Management Never use local state for shared or production environments. Store state in Azure Blob Storage with state locking via Azure Blob lease: terraform { backend "azurerm" { resource_group_name = "rg-terraform-state" storage_account_name = "sttfstate<unique>" container_name = "tfstate" key = "ai-agents/prod.tfstate" } } Create the state storage account and container before running terraform init . A bootstrap script or a separate Terraform workspace dedicated to state management are both valid approaches. Known Limitations and Workarounds Terraform coverage of Foundry is improving rapidly but is not yet complete. You should be aware of the following gaps as of mid-2025: Agent definitions are not in Terraform: The actual agent (its system prompt, instructions, tool configuration, and model binding) is created via the Azure AI Agent Service SDK or the Foundry portal, not via Terraform. Terraform provisions the infrastructure; your application code or a post-provisioning script creates the agent. Connections: Some connection types within a Foundry Project (e.g. Azure AI Search, custom connections) may require the Azure CLI or the Foundry SDK. Verify coverage in the AzureRM provider docs before assuming Terraform handles them. Model deployments: azurerm_cognitive_deployment covers OpenAI model deployments and is well-supported. Use this to deploy your model before referencing it from the agent. Private networking: If you need private endpoints for your Foundry Account, additional VNet, subnet, and DNS zone resources are required. This post focuses on the public networking path; private networking is a follow-on topic. Step-by-Step Implementation The following sections build up a complete Terraform configuration. The recommended project structure is a flat module layout for a single environment, with a separate modules/ai-foundry/ directory when you need to reuse the pattern across environments. ai-agents-infra/ ├── main.tf ├── variables.tf ├── outputs.tf ├── versions.tf └── terraform.tfvars 1. Variables Define variables first. Parameterising from the start avoids hard-coded values that create technical debt when you replicate the configuration for staging or production: # variables.tf variable "subscription_id" { type = string description = "Azure subscription ID." } variable "location" { type = string default = "eastus" description = "Azure region for all resources." } variable "environment" { type = string default = "dev" description = "Environment label (dev, staging, prod)." } variable "project_name" { type = string description = "Short name for the project. Used in resource naming." } variable "openai_model_name" { type = string default = "gpt-4.1" description = "Azure OpenAI model to deploy for the agent." } variable "openai_model_version" { type = string default = "2025-04-14" description = "Model version to deploy." } variable "openai_sku_capacity" { type = number default = 10 description = "Tokens-per-minute capacity (in thousands) for the deployment." } 2. Resource Group and Core Infrastructure A single resource group keeps things simple for dev. In production, consider splitting as described in the architecture section above. # main.tf — Resource group and naming locals locals { name_prefix = "${var.project_name}-${var.environment}" tags = { environment = var.environment project = var.project_name managed_by = "terraform" } } resource "azurerm_resource_group" "main" { name = "rg-${local.name_prefix}" location = var.location tags = local.tags } 3. Supporting Services Provision Log Analytics and Application Insights for agent observability and diagnostics. Unlike the legacy Hub-based architecture, the azurerm_cognitive_account (kind AIServices ) does not require a dedicated Storage Account or Key Vault as provisioning dependencies. # main.tf — Monitoring infrastructure data "azurerm_client_config" "current" {} # Log Analytics Workspace (required by Application Insights) resource "azurerm_log_analytics_workspace" "main" { name = "law-${local.name_prefix}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location sku = "PerGB2018" retention_in_days = 30 tags = local.tags } # Application Insights for agent observability resource "azurerm_application_insights" "main" { name = "appi-${local.name_prefix}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location workspace_id = azurerm_log_analytics_workspace.main.id application_type = "web" tags = local.tags } 4. User-Assigned Managed Identity A managed identity allows the Foundry Account and its projects to authenticate to Azure services without stored credentials. This is a security best practice and is required for several Microsoft Foundry features. # main.tf — Managed identity for the Microsoft Foundry Account resource "azurerm_user_assigned_identity" "foundry" { name = "id-${local.name_prefix}-foundry" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location tags = local.tags } 5. Microsoft Foundry Account and Model Deployment In the current Microsoft Foundry architecture, a single azurerm_cognitive_account of kind AIServices serves as both the Foundry Account and the model endpoint host. Set project_management_enabled = true and provide a globally unique custom_subdomain_name to enable Foundry Project creation beneath it. # main.tf — Microsoft Foundry Account (AI Services) resource "azurerm_cognitive_account" "foundry" { name = "aisa-${local.name_prefix}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location kind = "AIServices" sku_name = "S0" project_management_enabled = true custom_subdomain_name = "${replace(local.name_prefix, "-", "")}foundry" tags = local.tags identity { type = "UserAssigned" identity_ids = [azurerm_user_assigned_identity.foundry.id] } } # Deploy the model within the Foundry Account resource "azurerm_cognitive_deployment" "agent_model" { name = var.openai_model_name cognitive_account_id = azurerm_cognitive_account.foundry.id model { format = "OpenAI" name = var.openai_model_name version = var.openai_model_version } sku { name = "Standard" capacity = var.openai_sku_capacity } } Note on quota: The capacity value is in thousands of tokens per minute. A value of 10 means 10,000 TPM. If terraform apply fails with a quota error, reduce this value or request a quota increase via the Azure portal. Note on custom_subdomain_name : This must be globally unique across all Azure AI Services accounts. If provisioning fails with a conflict error, adjust the suffix (e.g. append a random string using the random_string resource). 6. Foundry Project Create a Foundry Project beneath the Foundry Account provisioned in Step 5. Each project scopes its own agents, model connections, and data assets. Use one project per application or team. # main.tf — Microsoft Foundry Project resource "azurerm_cognitive_account_project" "agent_project" { name = "proj-${local.name_prefix}-agents" cognitive_account_id = azurerm_cognitive_account.foundry.id location = azurerm_resource_group.main.location display_name = "Agent Project - ${var.project_name}" description = "Hosted agents project for ${var.project_name}" identity { type = "UserAssigned" identity_ids = [azurerm_user_assigned_identity.foundry.id] } tags = local.tags } 7. RBAC Role Assignments Grant the managed identity the permissions it needs. This is the area most commonly misconfigured in manual deployments. Terraform makes it explicit and auditable. # main.tf — RBAC assignments # AI Services: Foundry identity needs Cognitive Services OpenAI User to call model endpoints resource "azurerm_role_assignment" "foundry_openai" { scope = azurerm_cognitive_account.foundry.id role_definition_name = "Cognitive Services OpenAI User" principal_id = azurerm_user_assigned_identity.foundry.principal_id } # AI Services: Foundry identity needs Cognitive Services Contributor to manage projects resource "azurerm_role_assignment" "foundry_contributor" { scope = azurerm_cognitive_account.foundry.id role_definition_name = "Cognitive Services Contributor" principal_id = azurerm_user_assigned_identity.foundry.principal_id } # Optional: grant your own principal the Azure AI Developer role on the Foundry Account # so you can create and manage agents from your local machine or CI pipeline resource "azurerm_role_assignment" "developer_account" { scope = azurerm_cognitive_account.foundry.id role_definition_name = "Azure AI Developer" principal_id = data.azurerm_client_config.current.object_id } 8. Outputs Export the values your application and post-provisioning scripts will need: # outputs.tf output "resource_group_name" { value = azurerm_resource_group.main.name } output "foundry_account_id" { value = azurerm_cognitive_account.foundry.id } output "ai_foundry_project_id" { value = azurerm_cognitive_account_project.agent_project.id } output "foundry_endpoint" { value = azurerm_cognitive_account.foundry.endpoint } output "openai_deployment_name" { value = azurerm_cognitive_deployment.agent_model.name } output "managed_identity_client_id" { value = azurerm_user_assigned_identity.foundry.client_id } 10. Example terraform.tfvars # terraform.tfvars — do NOT commit this file if it contains sensitive values subscription_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" location = "eastus" environment = "dev" project_name = "contoso-agents" openai_model_name = "gpt-4.1" openai_model_version = "2025-04-14" openai_sku_capacity = 10 Figure 3: Terraform deployment workflow. State is stored in an Azure Blob Storage backend, enabling team collaboration and preventing concurrent apply conflicts. Deploying and Validating the Agent Infrastructure Running the Deployment # 1. Initialise — downloads provider plugins and configures the backend terraform init # 2. Validate syntax and configuration terraform validate # 3. Preview what will be created (review carefully before applying) terraform plan -out=tfplan # 4. Apply the plan terraform apply tfplan A full initial apply typically takes 8–15 minutes. The Foundry Account (AI Services) provisioning is the longest step. The model deployment may also take a few minutes to reach a ready state — Terraform handles this with implicit dependency ordering, but you may see brief retries in the output. Verifying the Deployment After apply completes, verify each resource is in a healthy state: # Confirm the resource group and its resources exist az resource list --resource-group "rg-contoso-agents-dev" --output table # Check the Foundry Account (AI Services) is in a Succeeded state az cognitiveservices account show \ --name "aisacontosoagentsdevfoundry" \ --resource-group "rg-contoso-agents-dev" \ --query "properties.provisioningState" # Confirm the model deployment is ready az cognitiveservices account deployment show \ --resource-group "rg-contoso-agents-dev" \ --name "aisacontosoagentsdevfoundry" \ --deployment-name "gpt-4.1" \ --query "properties.provisioningState" Navigate to the Microsoft Foundry portal and confirm your Foundry Account and Project appear. At this point you can create an agent manually in the portal to validate that the model endpoint is reachable and the identity chain works correctly before automating agent creation. Common Deployment Issues Quota exceeded on model deployment: Reduce openai_sku_capacity or request a quota increase in the Azure portal under Azure OpenAI → Quotas. Resource name conflicts: The custom_subdomain_name on the Foundry Account must be globally unique. Use the random_string Terraform resource to append a unique suffix if needed. Role assignment propagation delay: RBAC changes can take 1–2 minutes to propagate. If the Foundry Account cannot access resources immediately after apply, wait a moment and retry. project_management_enabled not set: If azurerm_cognitive_account_project fails with an error about project management, ensure project_management_enabled = true and custom_subdomain_name are set on the parent azurerm_cognitive_account . azurerm_cognitive_account_project not found: Ensure your AzureRM provider version is ~> 4.0 or later. Run terraform init -upgrade if you previously initialised with an older version. Creating an Agent After Infrastructure Provisioning Terraform has provisioned the platform. Now you need to create the agent itself. This is done via the Azure AI Agents SDK (available for Python, C#, JavaScript, and Java) or the Foundry portal. The following Python snippet demonstrates creating a basic agent programmatically after Terraform apply. It uses the outputs from Terraform directly: import os from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential # These values come from Terraform outputs project_connection_string = os.environ["AI_PROJECT_CONNECTION_STRING"] model_deployment = os.environ["OPENAI_DEPLOYMENT_NAME"] client = AIProjectClient.from_connection_string( credential=DefaultAzureCredential(), conn_str=project_connection_string, ) # Create the hosted agent agent = client.agents.create_agent( model=model_deployment, name="customer-support-agent", instructions=( "You are a helpful customer support assistant. " "Answer questions accurately and concisely. " "If you are unsure, say so rather than guessing." ), ) print(f"Agent created: {agent.id}") Figure 5: Agent runtime architecture. The Foundry Project hosts the Agent Service, which routes requests to the GPT-4.1 model endpoint and optionally invokes tool integrations (Code Interpreter, File Search, Azure Functions, or custom tools). The project connection string is available from the Foundry portal (Project → Overview → Project connection string) or can be constructed from Terraform outputs. Refer to the Azure AI Agents quickstart for the full SDK setup. Operational Considerations Lifecycle Management Terraform's declarative model means updates are incremental by default. To update the OpenAI model version, change openai_model_version in your .tfvars file and run terraform plan to confirm the change before applying. Terraform will delete and recreate the cognitive deployment in-place — be aware this causes brief downtime for the model endpoint. To destroy a complete environment: terraform destroy The prevent_deletion_if_contains_resources feature on the resource group will block destruction if any untracked resources exist, which is a useful safety net in production. Handling Configuration Drift Drift occurs when Azure resources are modified outside of Terraform (portal changes, CLI scripts, other automation). Detect drift with: terraform plan -refresh-only This reports the difference between the Terraform state and the actual resource state without making changes. Schedule this as a drift-detection job in CI to catch out-of-band changes early. Environment Isolation Use Terraform workspaces or separate state files per environment: # Create and switch to a staging workspace terraform workspace new staging terraform workspace select staging terraform apply -var-file="environments/staging.tfvars" Alternatively, use a directory-per-environment layout ( environments/dev/ , environments/prod/ ) with a shared module in modules/ai-foundry/ . The directory layout is more explicit and easier to navigate in a team setting. Cost Control Set a low openai_sku_capacity in dev (e.g. 1 = 1,000 TPM) to limit accidental spend. Tag all resources with environment and project tags (the locals.tags block handles this) to enable cost attribution in Azure Cost Management. Use the Azure Pricing Calculator to estimate monthly costs before deploying to production. The Azure AI Services account (model token usage), Log Analytics, and Application Insights are the primary cost drivers. Consider destroying dev environments overnight using a scheduled CI job that runs terraform destroy and terraform apply on a schedule. CI/CD Integration Automating Terraform via GitHub Actions is straightforward. The following workflow runs plan on pull requests and apply on merge to the main branch: # .github/workflows/terraform.yml name: Terraform Deploy on: push: branches: [main] pull_request: branches: [main] permissions: id-token: write # Required for OIDC workload identity federation contents: read pull-requests: write env: ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }} ARM_USE_OIDC: "true" jobs: terraform: runs-on: ubuntu-latest environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }} steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 with: terraform_version: "~1.9" - name: Terraform Init run: terraform init - name: Terraform Plan run: terraform plan -out=tfplan -var-file="environments/dev.tfvars" - name: Terraform Apply if: github.ref == 'refs/heads/main' run: terraform apply -auto-approve tfplan Figure 4: CI/CD pipeline using GitHub Actions with OIDC workload identity federation. No long-lived secrets are stored — the runner exchanges a JWT for a short-lived Azure token before each Terraform run. Use OIDC workload identity federation to avoid storing long-lived service principal secrets in GitHub. This is the recommended authentication method for GitHub Actions deployments to Azure. Best Practices Modular Terraform Design Once you have a working flat configuration, extract the Foundry resources into a reusable module. A module boundary around the Hub, Project, OpenAI account, and RBAC assignments lets you stamp out new agent environments with a single module call and a new .tfvars file. # environments/staging/main.tf module "agent_platform" { source = "../../modules/ai-foundry" project_name = "contoso-agents" environment = "staging" location = "eastus" subscription_id = var.subscription_id openai_model_name = "gpt-4.1" openai_model_version = "2025-04-14" openai_sku_capacity = 30 } Parameterisation and Environment Configs Never hard-code subscription IDs, tenant IDs, or region names in main.tf . Keep environment-specific values in environments/<env>.tfvars files and commit them to source control (they are config, not secrets). Store actual secrets (service principal credentials, API keys for third-party connections) in Azure Key Vault or GitHub Secrets — not in .tfvars files. Versioning Models and Agent Configurations Treat your openai_model_version and agent instructions as versioned artefacts. When Microsoft releases a new model version, create a pull request that updates the variable value, runs a plan, and documents the expected change. This creates a clear history of when model versions changed and who approved the change. Logging and Monitoring Enable diagnostic settings on the Azure OpenAI account to route request logs and metrics to your Log Analytics workspace. Use Application Insights to capture agent traces from the Azure AI Agents SDK (it integrates with OpenTelemetry). Set up Azure Monitor alerts on OpenAI account errors (4xx/5xx rates) and Log Analytics ingestion failures. Responsible AI Considerations Enable Azure OpenAI content filtering on your deployment. Terraform supports this via the content_filter block in azurerm_cognitive_deployment where the policy allows. Define a clear system prompt that sets agent behaviour boundaries and instructs the agent to decline harmful requests. Log and review agent conversations during early deployment. Microsoft Foundry includes evaluation tools for assessing agent response quality and safety. Apply least-privilege RBAC throughout — the role assignments in this post follow that principle. Conclusion and Next Steps You now have a complete, repeatable Terraform configuration for provisioning the Azure infrastructure required to run Microsoft Hosted Agents via Microsoft Foundry. The key takeaways: Terraform manages the infrastructure layer effectively — the Foundry Account, Project, model deployment, identity, and RBAC. Agent definitions themselves are provisioned via the Azure AI Agents SDK or the Foundry portal as a post-Terraform step. State management, parameterisation, and modular design are non-negotiable for team environments. OIDC-based workload identity is the right authentication model for CI/CD pipelines. Drift detection, environment isolation, and cost tagging are operational necessities, not optional extras. Where to Go Next Add Azure AI Search: Extend the Foundry Project with an Azure AI Search connection and enable the Search tool on your agent for Retrieval-Augmented Generation (RAG). Private networking: Add private endpoints for the Foundry Hub and OpenAI account to lock down ingress to your VNet. Multi-region deployment: Instantiate the Terraform module twice with different regions and use Azure Traffic Manager or Front Door to route requests. GitOps for agents: Store agent definitions (system prompts, tool configurations) as YAML or JSON in your repository and use a CI pipeline to apply them via the Azure AI Agents SDK on every merge, creating a fully declarative agent deployment pipeline. Evaluation pipelines: Use Microsoft Foundry's built-in evaluation capabilities to run automated quality and safety assessments on every new model version or prompt change. References What is Microsoft Foundry? — Microsoft Learn Azure AI Agent Service overview — Microsoft Learn Azure AI Agents quickstart — Microsoft Learn azurerm_cognitive_account — Terraform Registry azurerm_cognitive_account_project — Terraform Registry azurerm_cognitive_deployment — Terraform Registry AzureRM backend — Terraform documentation OIDC workload identity federation with GitHub Actions — Microsoft Learn Azure OpenAI content filtering — Microsoft Learn Install Terraform — HashiCorp Microsoft Foundry portalFrom Requirement to Production Code, How Engineering Squad Automates the Full Dev Lifecycle
I started wondering: what if instead of one AI assistant generating code snippets, you had an entire squad of specialized AI agents. Each owning a single stage of the delivery pipeline, they could collaborate, self-correct, and produce a complete, traceable output from a plain-text requirement? That's Engineering Squad: an open-source, multi-agent framework built with LangGraph, Azure OpenAI, and Foundry Local. Nine agents. One pipeline. Zero manual handoffs. You give it a requirement. It gives you back: - User stories with acceptance criteria - Technical design (API contracts, data models, architecture) - Full implementation code (written into real files, not markdown) - Unit tests and Playwright E2E tests - Automated code review with a self-correcting feedback loop When the Code Reviewer finds a bug, it doesn't just flag it, it routes the work back to the exact agent that needs to fix it. When the Spec Agent hits ambiguity, it stops and asks you rather than guessing. The loop runs up to 5 iterations, and every run is versioned under a unique Run ID for full traceability. It runs on Azure OpenAI for heavy reasoning, Foundry Local for lightweight tasks or entirely offline with --local-only mode. No cloud required. How It Works The squad is a directed graph of 9 specialized agents. Each agent has a single responsibility and a tuned system prompt. The orchestration is handled by LangGraph's StateGraph, which routes work through the pipeline and handles feedback loops. The Agents Agent Model Responsibility Product Owner Azure OpenAI gpt-4.1 Reads requirements, classifies impact scope Story Agent Foundry Local (qwen2.5-7b) Converts requirements → structured user stories Spec Agent Azure OpenAI o3 Resolves ambiguity — asks the user interactively Technical Design Azure OpenAI gpt-4.1 Architecture, API contracts, data models, error handling Developer Azure OpenAI gpt-4.1 Writes code directly into the codebase Unit Tester Azure OpenAI gpt-4.1 Writes unit tests and evaluates them against implementation Test Writer Foundry Local (qwen2.5-7b) Writes Playwright E2E tests using Page Object Model Tester Azure OpenAI o3 Final evaluation of code against all specs and tests Code Reviewer Azure OpenAI o3 Reviews everything, decides: approve or route back The Self-Correcting Loop This is where it gets interesting. The Code Reviewer doesn't just say "approved" or "rejected" — it makes a routing decision using structured output: class ReviewDecision(BaseModel): decision: Literal[ "approved", # Ship it "requirement_confusion", # → Spec Agent (clarify ambiguity) "clarity_missing", # → Technical Design (refine design) "code_missing", # → Developer (fix implementation) "bug_found", # → Developer (fix bugs) "test_case_missing", # → Test Writer (add coverage) ] feedback: str # Actionable feedback for the target agent LangGraph's conditional edges route the workflow back to the exact agent that needs to act. The loop runs up to 5 iterations with a hard stop to prevent infinite cycles. workflow.add_conditional_edges( "code_reviewer", route_review, { END: END, "spec_agent": "spec_agent", "technical_design": "technical_design", "developer": "developer", "test_writer": "test_writer", }, ) Key Design Decisions 1. Impact Classification — Don't Run What You Don't Need Not every change needs the full pipeline. The squad classifies scope first: Scope What Runs config Impact Analysis → Developer → Unit Tester → Reviewer bugfix Impact Analysis → Developer → Unit Tester → Tester → Reviewer enhancement Stories → Design (if needed) → Developer → All Tests → Reviewer feature Stories → Design → Developer → All Tests → Reviewer refactor Impact Analysis → Developer → Unit Tester → Reviewer A config change doesn't need user stories. A bugfix doesn't need a full architectural design. This keeps runs fast and focused. 2. Code Goes Into Real Files, Not Markdown This was a deliberate choice. The Developer Agent edits actual source files in your project — it doesn't dump code into a markdown artifact. The code_changes.md artifact is a change log that records what was modified and why, for traceability. 3. Existing Projects vs. Greenfield Set PROJECT_TYPE: existing in requirements_input.txt, point it at your repos, and the squad will: Scan your codebase for patterns, conventions, and architecture Make targeted changes only — no rewriting from scratch Preserve your existing coding style, error handling, and naming conventions 4. Two LLM Tiers — Cloud + Local The framework uses a hybrid model strategy: Azure OpenAI (gpt-4.1, o3) for complex reasoning: code generation, technical design, code review Foundry Local (qwen2.5-7b, phi-3.5-mini) for lightweight tasks: user stories, test writing This keeps costs down while maintaining quality where it matters. And with --local-only mode, you can run the entire squad on Foundry Local with zero cloud dependencies. Running It Locally with Foundry Local One of my favorite features: the entire squad can run 100% locally using Foundry Local. No Azure subscription, no API keys, no internet required. Setup # Install Foundry Local CLI (one-time) winget install Microsoft.FoundryLocal # Install Python dependencies pip install foundry-local-sdk openai langchain-openai langgraph python-dotenv # Run in local-only mode python main.py --local-only When --local-only is set, every agent that would normally call Azure OpenAI gets redirected to Foundry Local: def get_azure_llm(deployment: str, temperature: float = 0.1): # Local-only mode: redirect to Foundry Local if os.getenv("SQUAD_LOCAL_ONLY", "").lower() in ("true", "1", "yes"): from models.local_llm import get_local_llm return get_local_llm(temperature=temperature) # Otherwise: use Azure OpenAI with DefaultAzureCredential ... The foundry-local-sdk (v1.1.0+) handles everything — initializing the runtime, downloading models, and loading them: from foundry_local_sdk import FoundryLocalManager, Configuration # Initialize once (singleton) config = Configuration(app_name="my-app") manager = FoundryLocalManager(config) # Start OpenAI-compatible web service manager.start_web_service() print(manager.urls[0]) # SDK auto-discovers the endpoint # Download & load a model model = manager.catalog.get_model("qwen2.5-7b") model.download() model.load() # Chat directly — no web service needed chat = model.get_chat_client() response = chat.complete_chat([{"role": "user", "content": "Hello!"}]) Jupyter Notebook The repo includes a Jupyter notebook (foundry_local.ipynb) that walks you through: Installing Foundry Local Loading a model Sending chat completions (streaming and non-streaming) Running the full Engineering Squad in local-only mode Traceability — Every Run Is Versioned Every squad execution gets a unique Run ID and produces a structured artifact set: output/ runs/ 20260524_a3f9b1/ run_metadata.json ← run ID, timestamp, requirement hash, decision impact_classification.md user_stories.md technical_design.md code_changes.md ← change log (code is in real files) unit_test_results.md tests.md test_results.md review_feedback.md latest/ ← symlink to most recent approved run The run_metadata.json is structured for future Azure DevOps integration — auto-creating work items, tasks, and test cases from squad output. Two Ways to Run Mode Best For GitHub Copilot Agent Mode Existing codebases — Copilot has full workspace context via #codebase Python CLI (python main.py) New projects, CI pipelines, fully automated runs Running with GitHub Copilot Agent Mode This is the recommended way to run the squad on existing projects. Copilot has full access to your workspace — it can read files, write code, and run terminal commands — so it naturally understands your architecture, patterns, and conventions. Prerequisites VS Code with the GitHub Copilot and GitHub Copilot Chat extensions installed A Copilot subscription that supports Agent Mode (Copilot Pro, Business, or Enterprise) Setup Clone the repo and open it in VS Code: git clone https://github.com/prasunagga/engineeringSquad.git code engineeringSquad Switch to Agent Mode — In the Copilot Chat panel, click the mode dropdown (top of the chat input) and select "Agent". This is required — Ask and Edit modes don't have tool access. Enable tools — Click the 🔧 tools icon (or gear/settings icon) at the bottom of the chat input area. Make sure the following tools are enabled: File operations (read, create, edit files) Terminal (run commands) Code search / workspace context Without these enabled, the squad can't read your codebase or write code into files. Edit your requirement — Open requirements_input.txt and write your requirement: PROJECT_TYPE: existing FRONTEND_PATH: plant-catalog BACKEND_PATH: Build a cart page where users can add plants, adjust quantities, and see totals. Running the Squad In Copilot Chat (Agent Mode), type: /run-squad This triggers the .github/prompts/run-squad.prompt.md file — a prompt file with mode: agent in its YAML frontmatter that orchestrates the full workflow: --- mode: agent description: Run the full Engineering Squad workflow tools: - read_file - create_file - replace_in_file - insert_text - delete_file_range --- Copilot will then execute the full pipeline: read requirements → classify impact → generate stories → design → write code → write tests → run tests → code review → approve or loop back. How It Differs from Python CLI Copilot Agent Mode Python CLI Context Full workspace awareness via #codebase Reads files from paths in requirements_input.txt Human-in-loop Spec Agent asks you directly in chat Spec Agent prints questions to stdout Code editing Uses VS Code's file editing tools Writes files via Python open() Test execution Runs npm test / playwright test in VS Code terminal Runs via subprocess Model Uses whichever model is selected in Copilot Uses Azure OpenAI / Foundry Local Individual Agent Prompts The .github/prompts/ directory also contains standalone prompt files for running individual agents: Prompt Purpose run-squad.prompt.md Full orchestrated pipeline developer.prompt.md Developer agent only code-reviewer.prompt.md Code review only story-agent.prompt.md Generate user stories only technical-design.prompt.md Technical design only test-writer.prompt.md Write E2E tests only Extending the Framework The squad is designed to be modular. Here are the most common extension points: Add a New Agent Every agent follows the same pattern — a function that takes SquadState, calls an LLM, and returns updated fields: # agents/my_agent.py from langchain_core.prompts import ChatPromptTemplate from graph.state import SquadState from models.azure_llm import get_azure_llm, DEPLOYMENT_DEVELOPER PROMPT = ChatPromptTemplate.from_messages([ ("system", "You are a security review specialist."), ("human", "Review this code for vulnerabilities:\n{code}"), ]) def my_agent_node(state: SquadState) -> dict: llm = get_azure_llm(deployment=DEPLOYMENT_DEVELOPER) result = (PROMPT | llm).invoke({"code": state["code"]}) return {"security_review": result.content} Then wire it in: Add state fields in graph/state.py Register the node and edges in graph/workflow.py Add artifact output in main.py Swap the LLM for Any Agent Each agent calls get_azure_llm(deployment=...) or get_local_llm(). You can: Change the model — edit .env (e.g., AZURE_DEPLOYMENT_DEVELOPER=gpt-5.4) Go fully local — python main.py --local-only Use a different provider — replace get_azure_llm() with any LangChain-compatible LLM (Anthropic, Ollama, Groq, etc.) Customize Agent Prompts Each agent's system prompt is defined as a ChatPromptTemplate at the top of its file in agents/. Edit the prompt directly — no configuration layer to navigate. Change the Review Loop The routing logic lives in graph/workflow.py → route_review(). Add new decision strings, change the routing map, or adjust MAX_ITERATIONS (default: 5). VS Code Copilot Agent Mode The .github/prompts/ directory contains prompt files for running individual agents in VS Code Copilot Agent Mode. Edit these to customize agent behavior when running through Copilot. What I Learned Building This Structured output is essential for routing. Without Pydantic models for review decisions, the conditional edge routing would be fragile and string-matching-dependent. Impact classification saves significant time. Running 9 agents for a one-line config change is wasteful. Classifying scope first makes the system practical. The self-correcting loop works — but needs a hard stop. Left unchecked, agents can ping-pong feedback indefinitely. The 5-iteration cap is a pragmatic safety net. Hybrid local + cloud models are the right balance. Not every task needs GPT-4.1. User story generation and test writing work well on smaller local models, cutting costs without sacrificing quality. "Ask, don't guess" is the single most important principle. When the Spec Agent encounters ambiguous requirements, it stops and asks the user rather than hallucinating assumptions. This one rule prevents the most costly category of errors. Try It Yourself The framework is open source and designed to be extensible: git clone https://github.com/prasunagga/engineeringSquad.git cd engineeringSquad pip install -r requirements.txt # Edit your requirement notepad requirements_input.txt # Run (local-only, no Azure needed) python main.py --local-only Requirements: Python 3.10+ Windows, macOS, or Linux For local-only: Foundry Local (winget install Microsoft.FoundryLocal) For cloud mode: Azure OpenAI endpoint + az login What's Next Azure DevOps MCP integration — Auto-sync stories, tasks, and test cases to ADO boards CI/CD trigger — Auto-run the squad on PR creation or work item assignment Multi-repo support — Frontend, backend, and infra in separate repositories Cost estimation — Estimate effort and cloud costs from the technical design Links GitHub: github.com/prasunagga/engineeringSquad Foundry Local docs: learn.microsoft.com/en-us/azure/foundry-local/what-is-foundry-local LangGraph docs: langchain.com/langgraph Azure OpenAI docs: azure.microsoft.com/en-us/products/ai-foundry/models/openaiBuilding Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.Building an On-Device Voice Assistant with Microsoft Foundry Local
Why on-device voice still matters Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged. The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini , you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress. This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable. What we're building A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000 . The page captures microphone audio, posts it to /api/transcribe , then streams the chat reply back over Server-Sent Events from /api/chat . All inference runs locally through two Foundry Local models loaded into the same process. The shape of the pipeline: Microphone (browser MediaRecorder) │ WebM/Opus blob ▼ Client-side WAV encoder (16 kHz, mono, PCM-16) │ multipart/form-data ▼ FastAPI /api/transcribe │ ▼ Nemotron Speech Streaming En 0.6B (Foundry Local audio client) │ transcript text ▼ Chat LLM e.g. qwen2.5-0.5b (Foundry Local chat client) │ streamed tokens ▼ FastAPI /api/chat → SSE → browser bubble The version that bit us: foundry-local-sdk >= 1.1.0 Before any code, the single most important fact about this project: The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the alias nemotron-speech-streaming-en-0.6b and fail with model not found . The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore- sdk suffix), not foundry_local . The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about. Pin it explicitly: pip install --upgrade "foundry-local-sdk>=1.1.0,<2" And verify before anything else: python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))" # expect: sdk 1.1.0 Loading both models from one manager The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client. The wrapper used in the repo ( src/foundry_client.py ) does this: from foundry_local_sdk import Configuration, FoundryLocalManager FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron")) manager = FoundryLocalManager.instance chat_model = manager.load_model("qwen2.5-0.5b") stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b") chat_client = chat_model.get_chat_client() audio_client = stt_model.get_audio_client() Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB. The transcription surprise The first naive approach was the obvious one: with open(wav_path, "rb") as f: result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b") That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime. The fix is to use the streaming session API instead — a different native entry point ( core_interop.start_audio_stream ) that the streaming model does support. The repo isolates this in src/_nemotron_live.py : def transcribe_wav_live(audio_client, wav_path, *, language="en"): with wave.open(str(wav_path), "rb") as w: sample_rate = w.getframerate() channels = w.getnchannels() sample_width = w.getsampwidth() pcm = w.readframes(w.getnframes()) session = audio_client.create_live_transcription_session() session.settings.sample_rate = sample_rate session.settings.channels = channels session.settings.bits_per_sample = sample_width * 8 session.settings.language = language session.start() # Feed PCM in ~100 ms chunks from a worker thread, then stop. bytes_per_sec = sample_rate * channels * sample_width chunk_bytes = max(bytes_per_sec // 10, 1024) def _pusher(): try: for offset in range(0, len(pcm), chunk_bytes): session.append(pcm[offset:offset + chunk_bytes]) finally: session.stop() threading.Thread(target=_pusher, daemon=True).start() parts = [] for resp in session.get_stream(): for cp in getattr(resp, "content", []) or []: text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or "" if text: parts.append(text) return " ".join(p.strip() for p in parts if p.strip()).strip() Two things to notice: Push from a thread, read from the main coroutine. session.append() is a blocking write into the native stream and session.get_stream() is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session. Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency. Always session.stop() . Without it the generator never terminates and the request hangs. The other transcription surprise: browsers don't send WAV Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus . That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove. The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript: // Inside src/static/index.html async function webmToWav(blob) { const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 }); const buf = await ctx.decodeAudioData(await blob.arrayBuffer()); // Mix to mono const ch = buf.numberOfChannels; const mono = new Float32Array(buf.length); for (let c = 0; c < ch; c++) { const data = buf.getChannelData(c); for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch; } return encodeWav(mono, 16000); } function encodeWav(samples, sampleRate) { const buffer = new ArrayBuffer(44 + samples.length * 2); const view = new DataView(buffer); // RIFF header writeStr(view, 0, "RIFF"); view.setUint32(4, 36 + samples.length * 2, true); writeStr(view, 8, "WAVE"); // fmt chunk writeStr(view, 12, "fmt "); view.setUint32(16, 16, true); // PCM chunk size view.setUint16(20, 1, true); // PCM format view.setUint16(22, 1, true); // mono view.setUint32(24, sampleRate, true); view.setUint32(28, sampleRate * 2, true); // byte rate view.setUint16(32, 2, true); // block align view.setUint16(34, 16, true); // bits per sample // data chunk writeStr(view, 36, "data"); view.setUint32(40, samples.length * 2, true); // PCM-16 samples let o = 44; for (let i = 0; i < samples.length; i++, o += 2) { const s = Math.max(-1, Math.min(1, samples[i])); view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true); } return new Blob([view], { type: "audio/wav" }); } Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side. Wiring it into FastAPI The server ( src/app.py ) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request: @app.post("/api/transcribe") async def transcribe(audio: UploadFile = File(...)): data = await audio.read() with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: f.write(data); path = f.name text = _ai_client.transcribe(path) return {"text": text} @app.post("/api/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( _sse(_ai_client.stream_completion(req.messages)), media_type="text/event-stream", ) return {"text": _ai_client.chat_completion(req.messages)} Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost . What it looks like The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone. Performance, honestly This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation: First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for qwen2.5-0.5b . Subsequent loads (warm cache): a few seconds per model. End-to-end transcription of a 5-second utterance: well under a second after warm-up. First chat token from qwen2.5-0.5b : typically 200–500 ms; full short reply within 1–2 s. On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model. Trade-offs to know about Model quality. qwen2.5-0.5b is a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap in phi-4-mini or mistral-nemo-12b-instruct if you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog. STT is English-only here. The current Nemotron streaming model in the catalog is ...-en-0.6b . Multilingual variants are likely to follow. Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny getUserMedia by default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real. No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour. Responsible AI considerations Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility: Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red ⏹ button and a "Recording…" banner for that reason. Don't log raw audio. The sample writes audio to a per-request NamedTemporaryFile and deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device. Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters. Try it Clone github.com/leestott/fl-nemotron. ./setup.ps1 (or ./setup.sh ) to create a virtualenv and install the pinned SDK. python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models. .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000 Open http://127.0.0.1:8000 in a real browser and click the 🎤 button. Where to go next Foundry Local documentation — official docs for the runtime, catalog, and SDK. microsoft/Foundry-Local — upstream samples and issue tracker. NVIDIA Nemotron model family — background on the speech and language models being published into the catalog. leestott/fl-nemotron — the full source for this post. Key takeaways Pin foundry-local-sdk >= 1.1.0 . Earlier SDKs cannot see the Nemotron Speech Streaming model. Use the LiveAudioTranscriptionSession API for Nemotron, not AudioClient.transcribe() . Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS. Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks. A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.Edge AI for Beginners : Getting Started with Foundry Local
In Module 08 of the EdgeAI for Beginners course, Microsoft introduces Foundry Local a toolkit that helps you deploy and test Small Language Models (SLMs) completely offline. In this blog, I’ll share how I installed Foundry Local, ran the Phi-3.5-mini model on my windows laptop, and what I learned through the process. What Is Foundry Local? Foundry Local allows developers to run AI models locally on their own hardware. It supports text generation, summarization, and code completion — all without sending data to the cloud. Unlike cloud-based systems, everything happens on your computer, so your data never leaves your device. Prerequisites Before starting, make sure you have: Windows 10 or 11 Python 3.10 or newer Git Internet connection (for the first-time model download) Foundry Local installed Step 1 — Verify Installation After installing Foundry Local, open Command Prompt and type: foundry --version If you see a version number, Foundry Local is installed correctly. Step 2 — Start the Service Start the Foundry Local service using: foundry service start You should see a confirmation message that the service is running. Step 3 — List Available Models To view the models supported by your system, run: foundry model list You’ll get a list of locally available SLMs. Here’s what I saw on my machine: Note: Model availability depends on your device’s hardware. For most laptops, phi-3.5-mini works smoothly on CPU. Step 4 — Run the Phi-3.5 Model Now let’s start chatting with the model: foundry model run phi-3.5-mini-instruct-generic-cpu:1 Once it loads, you’ll enter an interactive chat mode. Try a simple prompt: Hello! What can you do? The model replies instantly — right from your laptop, no cloud needed. To exit, type: /exit How It Works Foundry Local loads the model weights from your device and performs inference locally.This means text generation happens using your CPU (or GPU, if available). The result: complete privacy, no internet dependency, and instant responses. Benefits for Students For students beginning their journey in AI, Foundry Local offers several key advantages: No need for high-end GPUs or expensive cloud subscriptions. Easy setup for experimenting with multiple models. Perfect for class assignments, AI workshops, and offline learning sessions. Promotes a deeper understanding of model behavior by allowing step-by-step local interaction. These factors make Foundry Local a practical choice for learning environments, especially in universities and research institutions where accessibility and affordability are important. Why Use Foundry Local Running models locally offers several practical benefits compared to using AI Foundry in the cloud. With Foundry Local, you do not need an internet connection, and all computations happen on your personal machine. This makes it faster for small models and more private since your data never leaves your device. In contrast, AI Foundry runs entirely on the cloud, requiring internet access and charging based on usage. For students and developers, Foundry Local is ideal for quick experiments, offline testing, and understanding how models behave in real-time. On the other hand, AI Foundry is better suited for large-scale or production-level scenarios where models need to be deployed at scale. In summary, Foundry Local provides a flexible and affordable environment for hands-on learning, especially when working with smaller models such as Phi-3, Qwen2.5, or TinyLlama. It allows you to experiment freely, learn efficiently, and better understand the fundamentals of Edge AI development. Optional: Restart Later Next time you open your laptop, you don’t have to reinstall anything. Just run these two commands again: foundry service start foundry model run phi-3.5-mini-instruct-generic-cpu:1 What I Learned Following the EdgeAI for Beginners Study Guide helped me understand: How edge AI applications work How small models like Phi 3.5 can run on a local machine How to test prompts and build chat apps with zero cloud usage Conclusion Running the Phi-3.5-mini model locally with Foundry Localgave me hands-on insight into edge AI. It’s an easy, private, and cost-free way to explore generative AI development. If you’re new to Edge AI, start with the EdgeAI for Beginners course and follow its Study Guide to get comfortable with local inference and small language models. Resources: EdgeAI for Beginners GitHub Repo Foundry Local Official Site Phi Model Link942Views1like0CommentsBuilding an End-to-End Azure RAG Strategy Agent with MS Foundry
High-Level Architecture This architecture represents an end-to-end Retrieval-Augmented Generation (RAG) pipeline where raw documents are ingested from Azure Blob Storage, processed using Document Intelligence, transformed into embeddings via Azure OpenAI, and indexed in Azure AI Search for hybrid retrieval. A Foundry/MAF-based agent orchestrates query processing by combining user input with relevant search results and generates contextual responses, which are exposed through a FastAPI or CLI interface. This solution is composed of two main layers: 1. Data Ingestion Layer (RAG Pipeline) This layer transforms raw enterprise documents into searchable knowledge. Flow: Raw documents stored in Azure Blob Storage Supported formats: PDF, DOCX, PPTX, images, etc. Document Intelligence extraction Extracts: Text Tables Key-value pairs Structure Writes output as structured JSON back to Blob (processed/) Chunking + Embedding Documents are split into chunks Each chunk is embedded using Azure OpenAI (text-embedding-*) Indexing into Azure AI Search Creates a hybrid index: Keyword search Semantic ranking Vector search Enables flexible retrieval strategies 2. Query Layer (Strategy Agents) This layer enables intelligent query answering. Flow: User sends a query via: FastAPI endpoint CLI interface Query is handled by: Microsoft Agent Framework (MAF) agent Running on Azure AI Foundry Agent: Queries Azure AI Search Retrieves top relevant chunks Injects them into LLM prompt LLM generates grounded response This follows the standard RAG pattern: Retrieval → Augmentation → Generation End-to-End Flow Key Azure Services Used Service Purpose Azure Blob Storage Raw + processed document storage Azure AI Document Intelligence Extract structured content Azure OpenAI Embeddings + LLM generation Azure AI Search Hybrid retrieval engine Azure AI Foundry Agent orchestration Microsoft Agent Framework Agent execution layer Why this Architecture Matters This solution goes beyond basic RAG and provides: Hybrid Retrieval Combines keyword + semantic + vector search Improves recall and accuracy Structured Document Parsing Handles complex enterprise documents Extracts tables and metadata Agent-Based Orchestration Enables reasoning over retrieval results Extensible for multi-agent workflows Scalable Data Pipeline Supports continuous ingestion Works with large document collections Enterprise Considerations Use Managed Identity for secure service access Apply RBAC on Cosmos DB / Search / Storage Enable Private Endpoints for network isolation Use Guardrails + Evaluations in Foundry Summary This repository demonstrates a production-ready Azure RAG architecture: Ingest → Extract → Chunk → Embed → Index Retrieve → Reason → Generate Powered by Azure AI Foundry + Agent Framework By combining data engineering + AI orchestration, it enables enterprise AI systems that are: Accurate Grounded Extensible Repo: https://github.com/snd94/azure-rag-strategy-agent Please refer to the Microsoft Learn Documentation for further information: Azure AI Search documentation - Azure AI Search | Microsoft Learn Document Intelligence documentation - Quickstarts, Tutorials, API Reference - Foundry Tools | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn Microsoft Agent Framework Overview | Microsoft Learn What is Microsoft Foundry? - Microsoft Foundry | Microsoft LearnThe Cloud Foundation for Safe Agentic AI
Why enterprise agents need more than a working prototype Most AI conversations start with the model. Which model should we use? Which framework? Which agent platform? Which demo can we build quickly enough to make the idea feel real? Those questions are not wrong, but they are rarely the first questions that matter in an enterprise environment. In real projects, the hard part usually appears after the first prototype works. The demo can answer a question, call a tool, retrieve a document, or update a record. Then someone asks whether it can be connected to production data, used by more teams, or allowed to trigger real actions. That is where the conversation changes. In the first part of this series, I looked at why many companies are less ready for agentic AI than they think. The blockers were practical and familiar: unclear business problems, immature processes, weak data foundations, and no clear owner when an AI system makes a poor recommendation or takes a wrong action. The message was simple: Before a company asks what agents can do, it needs to understand what it is ready to delegate. But business readiness is only the first layer. Even when the use case is clear, the process is understood, and leadership is aligned, another question appears. Is the platform ready to support agents safely? This is where Part 2 begins. Agentic AI does not behave like a normal application workload. A traditional application usually follows predefined paths. It receives a request, processes logic, returns a response, writes to a database, or calls an API. Agents introduce a different pattern. They reason over context, retrieve information, choose tools, trigger actions, interact with other services, and sometimes operate across multiple systems at once. That makes the surrounding cloud platform much more important. There is also a shadow AI angle to this. In many organizations, agent-like capabilities are already entering through SaaS platforms, vendor copilots, browser extensions, and productivity tools. These systems may not run inside the organization’s governed Azure subscriptions, but they can still interact with enterprise data and business workflows. If the official platform is not ready, teams will often find less governed ways to experiment anyway. That is not always malicious. Sometimes it is just people trying to solve their work with the tools available to them. The marketing analyst pasting customer data into a public chatbot because the official AI platform is six months away. The support team using a browser extension that summarizes tickets, without anyone realizing those tickets are also being sent to a third-party service. From a governance point of view, the effect is the same. Cloud readiness for agentic AI is not defined by access to cloud services or model endpoints alone. The real question is whether the platform can support controlled autonomy. Before enterprises can trust agents to act, the platform must be able to identify them, observe their behavior, restrict their permissions, enforce policy, and contain failure. Without that, an organization is not really deploying an intelligent assistant in a controlled way. It is introducing a workload that can interact with enterprise systems without anyone clearly watching what it does or being able to stop it. From business readiness to cloud readiness After the business foundation is clear, the next layer is the cloud foundation. A company may have a strong use case, executive support, and even a working prototype. But that does not mean it is ready to deploy agents in production. A prototype can run with broad access, manual supervision, loose logging, and a small group of test users. Production requires more discipline. It requires clear identity, controlled access, traceable activity, enforceable policy, and operational ownership. Cloud readiness for agentic AI comes down to four pillars, in this order: Identity-first architecture Observability Policy controls Platform constraints The order matters. 1. Identity-first architecture Identity comes first because nothing can be governed properly if it cannot be identified. In traditional cloud systems, we already learned this lesson with users, applications, service principals, managed identities, and workloads. Agents add another layer of non-human actors into the enterprise environment. If an agent can retrieve data, call tools, trigger workflows, or interact with business systems, it needs a clear identity. Without that foundation, governance becomes fragile. Teams may struggle to control what the agent can access, understand what it did, or determine who is accountable when something goes wrong. I have seen agents running in production where nobody could clearly say who owned them. They worked. Until they did not. Identity-first architecture means each agent or agentic workload should have a defined identity, ownership model, permission scope, and lifecycle. It should be clear whether the agent is acting on behalf of a user, acting as a service, or operating within a delegated boundary. This matters because permissions are not an implementation detail. They define the blast radius and accountability model of the system. In Azure environments, this is where Microsoft Entra ID and newer agent identity capabilities become important. As agents become more common across Azure AI Foundry, Copilot Studio, Microsoft 365, and custom frameworks, organizations need a way to understand which agents exist, who owns them, what they can access, and how their lifecycle is managed. Identity is not only about authentication. It is also about visibility, traceability, ownership, permission boundaries, and accountability. Agents should not remain hidden inside application logic or operate through shared identities. If they can retrieve data, call tools, or trigger actions, they need to be managed with the same care as any other production workload. 2. Observability Once identity is established, observability becomes the next pillar. Knowing that an agent exists is not enough. The platform must be able to show what the agent did. For normal applications, observability often focuses on service health, latency, failures, and resource usage. For agents, those signals still matter, but they are incomplete. Agent observability also needs to capture the execution path across model calls, retrieved context, orchestration steps, tool calls, policy decisions, approvals, denials, and final actions. This changes how we think about monitoring. With agentic systems, the question is not only whether a request succeeded or failed. Teams also need to understand the path that led to the outcome, the context used, the tools called, the policies applied, and the point where behavior changed. Without that visibility, it is difficult to investigate failures and improve reliability. This is also where observability starts to support governance, not just troubleshooting. Once teams can measure how agents behave, they can move toward KPI-based governance. That may include reliability, escalation rates, policy denials, grounding quality, tool-call failures, cost per interaction, latency, and business outcome metrics. Without this measurement layer, maturity remains mostly opinion-based. With it, governance becomes evidence-based. In Azure, Azure Monitor is the obvious starting point. Together with services such as Application Insights and Log Analytics, it provides the telemetry foundation needed to understand how AI workloads behave in production. For agentic systems, this usually requires combining platform telemetry with application-level traces from orchestration, retrieval, model calls, policy decisions, and tool execution. This visibility is what makes continuous improvement possible. It is also what allows governance to mature from “we think the agent is behaving correctly” to “we can measure how the agent behaves over time.” Small difference. Large consequence. 3. Policy controls The third pillar is policy controls. This comes after identity and observability because policy needs both. Identity defines who or what the rule applies to. Observability helps teams understand whether the rule is effective, bypassed, misconfigured, or too restrictive. Policy controls define the boundaries for what agents are allowed to do. They determine how agents access data, which tools they can use, which environments are in scope, when approval is required, and when an action or response should be blocked. The key point is simple: Prompts can guide behavior, but they are not a reliable enforcement layer. For enterprise systems, policy needs to be external, testable, auditable, and enforceable. This becomes especially important because agents may operate across multiple systems. An agent may retrieve information from one source, reason over the result, call a tool, update a ticket, send a message, or trigger a workflow. Each step may appear safe in isolation, while the full chain creates risk. Policy controls provide boundaries around that chain. In Azure, this starts at the cloud governance layer. Azure landing zones, management group structures, and Azure Policy can help define where AI workloads are deployed, how environments are separated, and which rules apply consistently across subscriptions. At runtime, Azure AI Content Safety can help detect harmful content, prompt attacks, unsafe interactions, or outputs that drift away from the intended task. For tool and API access, Azure API Management can also be used as a controlled gateway between agents and downstream systems. This can support centralized authentication, throttling, mediation, logging, and policy enforcement. It is not mandatory in every design, but it is a useful option when agents need governed access to APIs instead of direct backend connectivity. The goal is not to create friction for the sake of control. The goal is to make sure the agent operates inside boundaries that are defined outside the prompt and outside the model response. 4. Platform constraints The fourth pillar is platform constraints. This area often receives less attention early in the project, but it strongly shapes whether an agentic system can operate safely and reliably in production. These constraints include network isolation, private connectivity, data residency, regional availability, quota limits, model throughput, latency, logging retention, integration boundaries, cost behavior, and operational ownership. They may seem like implementation details during early design discussions, but they often determine whether the system can actually run in production. For agentic workloads, these constraints also shape where experimentation happens. Sandboxed environments, isolated subscriptions, limited tool access, and controlled test data can help teams evaluate agent behavior before exposing it to production systems. This becomes even more important when agents are allowed to generate code, call external tools, or execute actions that may not be fully trusted at design time. Platform constraints are where the earlier pillars meet implementation reality. Identity affects how agents connect to services. Observability affects logging cost, retention, and investigation capability. Policy affects routing, network design, tool exposure, and user experience. By the time an agentic system reaches production, these constraints are no longer background details. They become design boundaries. In Azure, this is where landing zone design, private networking, regional planning, quota management, cost management, and operational runbooks matter. Azure landing zones, private endpoints, private DNS, Azure Firewall, NSGs, and controlled network paths all influence whether the agent architecture can move from prototype to production without being redesigned halfway through. And yes, that redesign usually happens at the least convenient moment. Architecture has a sense of humor. Not a kind one. From principles to Azure capabilities The four pillars are not only architectural principles. They need to be translated into platform capabilities, operating practices, and governance controls. In practice, controlled agent deployment is rarely achieved by a single product or service. It requires multiple layers working together. Identity, monitoring, policy, networking, runtime safety, API exposure, and operational controls all play a part. Azure provides several services and patterns that can help implement these controls, but there is no fixed blueprint that applies to every organization. The right combination depends on the use case, regulatory requirements, existing landing zone design, integration landscape, and the level of autonomy expected from the agent. The examples below should be seen as a practical toolset, not as a mandatory checklist. Pillar Goal Example Azure capabilities Identity-first architecture Make agents visible, owned, permissioned, and governable as enterprise workloads. Microsoft Entra ID, Microsoft Entra Agent ID, managed identities, service principals, workload identities, access reviews, Conditional Access, Privileged Identity Management Observability Understand runtime behavior, trace execution paths, investigate failures, and improve reliability. Azure Monitor, Application Insights, Log Analytics, Azure AI Foundry tracing, diagnostic settings, distributed tracing, correlation IDs, application-level telemetry Policy controls Enforce boundaries around access, actions, content safety, APIs, and governance. Azure landing zones, management groups, Azure Policy, Azure AI Content Safety, Prompt Shields, Microsoft Purview, Azure API Management, RBAC, approval flows Platform constraints Operate within real cloud boundaries such as networking, region, quota, compliance, and operations. Azure landing zones, private endpoints, private DNS, private networking, Azure Firewall, NSGs, quota planning, regional architecture, cost management The purpose of this mapping is not to suggest that Azure has one single service for each pillar. It does not. The practical goal is to combine the right services and patterns so the platform can identify agents, monitor their behavior, enforce boundaries, and operate within known cloud constraints. Conclusion Agentic AI does not become enterprise-ready simply because a model is available, a prototype works, or a business sponsor is excited. The real question is whether the surrounding cloud foundation can support agents that act within boundaries the platform actually enforces. Together, these pillars move the discussion from building an agent to preparing the environment in which the agent can operate responsibly. That distinction is important. A prototype can rely on broad access, limited logging, and close manual supervision. A production system needs clearer boundaries around ownership, access, traceability, and control. This is also where the series moves naturally into Part 3. Once the business foundation is clear and the cloud foundation is in place, the next challenge is the design of the agent itself. The cloud foundation matters here because it provides the controlled environment in which agents can be tested, limited, and observed before they are trusted with broader enterprise access. For more advanced scenarios, that also includes sandboxing patterns for generated code, tool execution, and untrusted actions. In Part 3, I will move closer to implementation and look at how to design an enterprise-ready agent. That means defining the agent’s scope, grounding it with reliable knowledge, deciding which tools it can use, designing safe execution loops, adding human oversight where it matters, and thinking carefully about when a single agent is enough versus when multi-agent coordination is justified. That is where agentic AI starts becoming more than an idea. And, as usual, that is also where the architecture starts to matter. This article is part of my Agentic AI readiness series and was also published on Medium.Learn how to host your agents on Microsoft Foundry
We just concluded Host your agents on Foundry, a three-part livestream series where we explored how to deploy and host Python AI agents on Microsoft Foundry: Deploying Python agents to Foundry Hosted agents using the Azure Developer CLI Building hosted agents with Microsoft Agent Framework, including Foundry IQ integration and multi-agent workflows Building hosted agents with LangChain + LangGraph, including built-in tools like Bing Web Search Running quality and safety evaluations: bulk, scheduled, and continuous evals, guardrails, and red-teaming All of the materials from our series are available for you to keep learning from, and linked below: Video recordings of each stream PowerPoint slides that you can use for reviewing or even teaching the material to your own community Open-source code samples you can run yourself in your own Microsoft Foundry project Spanish speaker? Check out the Spanish version of the series. 🙋🏽♂️ Have follow up questions? Join the weekly Python+AI office hours on Foundry Discord. Host your agents on Foundry: Microsoft Agent Framework 📺 Watch YouTube recording In our first session, we deploy agents built with Microsoft Agent Framework (the successor of Autogen and Semantic Kernel). Starting with a simple agent, we add Foundry tools like Code Interpreter, ground the agent in enterprise data with Foundry IQ, and finally deploy multi-agent workflows. Along the way, we use the Foundry UI to interact with the hosted agent, testing it out in the playground and observing the traces from the reasoning and tool calls. 🖼️ Slides for this session 💻 Code repository with examples: foundry-hosted-agentframework-demos 📝 Write-up for this session Host your agents on Foundry: LangChain + LangGraph 📺 Watch YouTube recording In our second session, we deploy agents built with the popular open-source libraries LangChain and LangGraph. Starting with a simple agent, we add Foundry tools like Bing Web Search, ground the agent in Foundry IQ, then deploy more complex agents using the LangGraph orchestration framework. Along the way, we use the Foundry UI to interact with the hosted agent, testing it out in the playground and observing the traces from the reasoning and tool calls. 🖼️ Slides for this session 💻 Code repository with examples: foundry-hosted-langchain-demos 📝 Write-up for this session Host your agents on Foundry: Quality & safety evaluations 📺 Watch YouTube recording In our third session, we ensure that our AI agents are producing high-quality outputs and operating safely and responsibly. First we explore what it means for agent outputs to be high quality, using built-in evaluators to check overall task adherence and then building custom evaluators for domain-specific checks. With Foundry hosted agents, we run bulk evaluations on demand, set up scheduled evaluations, and even enable continuous evaluation on a subset of live agent traces. Next we discuss safety systems that can be layered on top of agents and audit agents for potential safety risks. To improve compliance with an organization's goals, we configure custom policies and guardrails that can be shared across agents. Finally, we ensure that adversarial inputs can't produce unsafe outputs by running automated red-teaming scans on agents, and even schedule those to run regularly as well. 🖼️ Slides for this session 💻 Code repository with examples: foundry-hosted-agentframework-demos 📝 Write-up for this session