ai solutions
58 TopicsOptiMind: A small language model with optimization expertise
Turning a real world decision problem into a solver ready optimization model can take days—sometimes weeks—even for experienced teams. The hardest part is often not solving the problem; it’s translating business intent into precise mathematical objectives, constraints, and variables. OptiMind is designed to try and remove that bottleneck. This optimization‑aware language model translates natural‑language problem descriptions into solver‑ready mathematical formulations, can help organizations move from ideas to decisions faster. Now available through public preview as an experimental model through Microsoft Foundry, OptiMind targets one of the more expertise‑intensive steps in modern optimization workflows. Addressing the Optimization Bottleneck Mathematical optimization underpins many enterprise‑critical decisions—from designing supply chains and scheduling workforces to structuring financial portfolios and deploying networks. While today’s solvers can handle enormous and complex problem instances, formulating those problems remains a major obstacle. Defining objectives, constraints, and decision variables is an expertise‑driven process that often takes days or weeks, even when the underlying business problem is well understood. OptiMind tries to address this gap by automating and accelerating formulation. Developed by Microsoft Research, OptiMind transforms what was once a slow, error‑prone modeling task into a streamlined, repeatable step—freeing teams to focus on decision quality rather than syntax. What makes OptiMind different? OptiMind is not just as a language model, but as a specialized system built for real-world optimization tasks. Unlike general-purpose large language models adapted for optimization through prompting, OptiMind is purpose-built for mixed integer linear programming (MILP), and its design reflects this singular focus. At inference time, OptiMind follows a multi‑stage process: Problem classification (e.g., scheduling, routing, network design) Hint retrieval tailored to the identified problem class Solution generation in solver‑compatible formats such as GurobiPy Optional self‑correction, where multiple candidate formulations are generated and validated This design can improve reliability without relying on agentic orchestration or multiple large models. In internal evaluations on cleaned public benchmarks—including IndustryOR, Mamo‑Complex, and OptMATH—OptiMind demonstrated higher formulation accuracy than similarly sized open models and competitive performance relative to significantly larger systems. OptiMind improved accuracy by approximately 10 percent over the base model. In comparison to open-source models under 32 billion parameters, OptiMind was also found to match or exceed performance benchmarks. For more information on the model, please read the official research blog or the technical paper for OptiMind. Practical use cases: Unlocking efficiency across domains OptiMind is especially valuable where modeling effort—not solver capability—is the primary bottleneck. Typical use cases include: Supply Chain Network Design: Faster formulation of multi‑period network models and logistics flows Manufacturing and Workforce Scheduling: Easier capacity planning under complex operational constraints Logistics and Routing Optimization: Rapid modeling that captures real‑world constraints and variability Financial Portfolio Optimization: More efficient exploration of portfolios under regulatory and market constraints By reducing the time and expertise required to move from problem description to validated model, OptiMind helps teams reach actionable decisions faster and with greater confidence. Getting started OptiMind is available today as an experimental model, and Microsoft Research welcomes feedback from practitioners and enterprise teams. Next steps: Explore the research details: Read more about the model on Foundry Labs and the technical paper on arXiv Try the model: Access OptiMind through Microsoft Foundry Test sample code: Available in the OptiMind GitHub repository Take the next step in optimization innovation with OptiMind—empowering faster, more accurate, and cost-effective problem solving for the future of decision intelligence.274Views0likes0CommentsFrom Zero to Hero: AgentOps - End-to-End Lifecycle Management for Production AI Agents
The shift from proof-of-concept AI agents to production-ready systems isn't just about better models—it's about building robust infrastructure that can develop, deploy, and maintain intelligent agents at enterprise scale. As organizations move beyond simple chatbots to agentic systems that plan, reason, and act autonomously, the need for comprehensive Agent LLMOps becomes critical. This guide walks through the complete lifecycle for building production AI agents, from development through deployment to monitoring, with special focus on leveraging Azure AI Foundry's hosted agents infrastructure. The Evolution: From Single-Turn Prompts to Agentic Workflows Traditional AI applications operated on a simple request-response pattern. Modern AI agents, however, are fundamentally different. They maintain state across multiple interactions, orchestrate complex multi-step workflows, and dynamically adapt their approach based on intermediate results. According to recent analysis, agentic workflows represent systems where language models and tools are orchestrated through a combination of predefined logic and dynamic decision-making. Unlike monolithic systems where a single model attempts everything, production agents break down complex tasks into specialized components that collaborate effectively. The difference is profound. A simple customer service chatbot might answer questions from a knowledge base. An agentic customer service system, however, can search multiple data sources, escalate to specialized sub-agents for technical issues, draft response emails, schedule follow-up tasks, and learn from each interaction to improve future responses. Stage 1: Development with any agentic framework Why LangGraph for Agent Development? LangGraph has emerged as a leading framework for building stateful, multi-agent applications. Unlike traditional chain-based approaches, LangGraph uses a graph-based architecture where each node represents a unit of work and edges define the workflow paths between them. The key advantages include: Explicit State Management: LangGraph maintains persistent state across nodes, making it straightforward to track conversation history, intermediate results, and decision points. This is critical for debugging complex agent behaviors. Visual Workflow Design: The graph structure provides an intuitive way to visualize and understand agent logic. When an agent misbehaves, you can trace execution through the graph to identify where things went wrong. Flexible Control Flows: LangGraph supports diverse orchestration patterns—single agent, multi-agent, hierarchical, sequential—all within one framework. You can start simple and evolve as requirements grow. Built-in Memory: Agents automatically store conversation histories and maintain context over time, enabling rich personalized interactions across sessions. Core LangGraph Components Nodes: Individual units of logic or action. A node might call an AI model, query a database, invoke an external API, or perform data transformation. Each node is a Python function that receives the current state and returns updates. Edges: Define the workflow paths between nodes. These can be conditional (routing based on the node's output) or unconditional (always proceeding to the next step). State: The data structure passed between nodes and updated through reducers. Proper state design is crucial—it should contain all information needed for decision-making while remaining manageable in size. Checkpoints: LangGraph's checkpointing mechanism saves state at each node, enabling features like human-in-the-loop approval, retry logic, and debugging. Implementing the Agentic Workflow Pattern A robust production agent typically follows a cyclical pattern of planning, execution, reflection, and adaptation: Planning Phase: The agent analyzes the user's request and creates a structured plan, breaking complex problems into manageable steps. Execution Phase: The agent carries out planned actions using appropriate tools—search engines, calculators, code interpreters, database queries, or API calls. Reflection Phase: After each action, the agent evaluates results against expected outcomes. This critical thinking step determines whether to proceed, retry with a different approach, or seek additional information. Decision Phase: Based on reflection, the agent decides the next course of action—continue to the next step, loop back to refine the approach, or conclude with a final response. This pattern handles real-world complexity far better than simple linear workflows. When an agent encounters unexpected results, the reflection phase enables adaptive responses rather than brittle failure. Example: Building a Research Agent with LangGraph from langgraph.graph import StateGraph, END from langchain_openai import ChatOpenAI from typing import TypedDict, List class AgentState(TypedDict): query: str plan: List[str] current_step: int research_results: List[dict] final_answer: str def planning_node(state: AgentState): # Agent creates a research plan llm = ChatOpenAI(model="gpt-4") plan = llm.invoke(f"Create a research plan for: {state['query']}") return {"plan": plan, "current_step": 0} def research_node(state: AgentState): # Execute current research step step = state['plan'][state['current_step']] # Perform web search, database query, etc. results = perform_research(step) return {"research_results": state['research_results'] + [results]} def reflection_node(state: AgentState): # Evaluate if we have enough information if len(state['research_results']) >= len(state['plan']): return {"next": "synthesize"} return {"next": "research", "current_step": state['current_step'] + 1} def synthesize_node(state: AgentState): # Generate final answer from all research llm = ChatOpenAI(model="gpt-4") answer = llm.invoke(f"Synthesize research: {state['research_results']}") return {"final_answer": answer} # Build the graph workflow = StateGraph(AgentState) workflow.add_node("planning", planning_node) workflow.add_node("research", research_node) workflow.add_node("reflection", reflection_node) workflow.add_node("synthesize", synthesize_node) workflow.add_edge("planning", "research") workflow.add_edge("research", "reflection") workflow.add_conditional_edges( "reflection", lambda s: s["next"], {"research": "research", "synthesize": "synthesize"} ) workflow.add_edge("synthesize", END) agent = workflow.compile() This pattern scales from simple workflows to complex multi-agent systems with dozens of specialized nodes. Stage 2: CI/CD Pipeline for AI Agents Traditional software CI/CD focuses on code quality, security, and deployment automation. Agent CI/CD must additionally handle model versioning, evaluation against behavioral benchmarks, and non-deterministic behavior. Build Phase: Packaging Agent Dependencies Unlike traditional applications, AI agents have unique packaging requirements: Model artifacts: Fine-tuned models, embeddings, or model configurations Vector databases: Pre-computed embeddings for knowledge retrieval Tool configurations: API credentials, endpoint URLs, rate limits Prompt templates: Versioned prompt engineering assets Evaluation datasets: Test cases for agent behavior validation Best practice is to containerize everything. Docker provides reproducibility across environments and simplifies dependency management: FROM python:3.11-slim WORKDIR /app COPY . user_agent/ WORKDIR /app/user_agent RUN if [ -f requirements.txt ]; then \ pip install -r requirements.txt; \ else \ echo "No requirements.txt found"; \ fi EXPOSE 8088 CMD ["python", "main.py"] Register Phase: Version Control Beyond Git Code versioning is necessary but insufficient for AI agents. You need comprehensive artifact versioning: Container Registry: Azure Container Registry stores Docker images with semantic versioning. Each agent version becomes an immutable artifact that can be deployed or rolled back at any time. Prompt Registry: Version control your prompts separately from code. Prompt changes can dramatically impact agent behavior, so treating them as first-class artifacts enables A/B testing and rapid iteration. Configuration Management: Store agent configurations (model selection, temperature, token limits, tool permissions) in version-controlled files. This ensures reproducibility and enables easy rollback. Evaluate Phase: Testing Non-Deterministic Behavior The biggest challenge in agent CI/CD is evaluation. Unlike traditional software where unit tests verify exact outputs, agents produce variable responses that must be evaluated holistically. Behavioral Testing: Define test cases that specify desired agent behaviors rather than exact outputs. For example, "When asked about product pricing, the agent should query the pricing API, handle rate limits gracefully, and present information in a structured format." Evaluation Metrics: Track multiple dimensions: Task completion rate: Did the agent accomplish the goal? Tool usage accuracy: Did it call the right tools with correct parameters? Response quality: Measured via LLM-as-judge or human evaluation Latency: Time to first token and total response time Cost: Token usage and API call expenses Adversarial Testing: Intentionally test edge cases—ambiguous requests, tool failures, rate limiting, conflicting information. Production agents will encounter these scenarios. Recent research on CI/CD for AI agents emphasizes comprehensive instrumentation from day one. Track every input, output, API call, token usage, and decision point. After accumulating production data, patterns emerge showing which metrics actually predict failures versus noise. Deploy Phase: Safe Production Rollouts Never deploy agents directly to production. Implement progressive delivery: Staging Environment: Deploy to a staging environment that mirrors production. Run automated tests and manual QA against real data (appropriately anonymized). Canary Deployment: Route a small percentage of traffic (5-10%) to the new version. Monitor error rates, latency, user satisfaction, and cost metrics. Automatically rollback if any metric degrades beyond thresholds. Blue-Green Deployment: Maintain two production environments. Deploy to the inactive environment, verify it's healthy, then switch traffic. Enables instant rollback by switching back. Feature Flags: Deploy new agent capabilities behind feature flags. Gradually enable them for specific user segments, gather feedback, and iterate before full rollout. Now since we know how to create an agent using langgraph, the next step will be understand how can we use this langgraph agent to deploy in Azure AI Foundry. Stage 3: Azure AI Foundry Hosted Agents Hosted agents are containerized agentic AI applications that run on Microsoft's Foundry Agent Service. They represent a paradigm shift from traditional prompt-based agents to fully code-driven, production-ready AI systems. When to Use Hosted Agents: ✅ Complex agentic workflows - Multi-step reasoning, branching logic, conditional execution ✅ Custom tool integration - External APIs, databases, internal systems ✅ Framework-specific features - LangGraph graphs, multi-agent orchestration ✅ Production scale - Enterprise deployments requiring autoscaling ✅ Auth- Identity and authentication, Security and compliance controls ✅ CI/CD integration - Automated testing and deployment pipelines Why Hosted Agents Matter Hosted agents bridge the gap between experimental AI prototypes and production systems: For Developers: Full control over agent logic via code Use familiar frameworks and tools Local testing before deployment Version control for agent code For Enterprises: No infrastructure management overhead Built-in security and compliance Scalable pay-as-you-go pricing Integration with existing Azure ecosystem For AI Systems: Complex reasoning patterns beyond prompts Stateful conversations with persistence Custom tool integration and orchestration Multi-agent collaboration Before you get started with Foundry. Deploy Foundry project using the starter code using AZ CLI. # Initialize a new agent project azd init -t https://github.com/Azure-Samples/azd-ai-starter-basic # The template automatically provisions: # - Foundry resource and project # - Azure Container Registry # - Application Insights for monitoring # - Managed identities and RBAC # Deploy azd up The extension significantly reduces the operational burden. What previously required extensive Azure knowledge and infrastructure-as-code expertise now works with a few CLI commands. The extension significantly reduces the operational burden. What previously required extensive Azure knowledge and infrastructure-as-code expertise now works with a few CLI commands. Local Development to Production Workflow A streamlined workflow bridges development and production: Develop Locally: Build and test your LangGraph agent on your machine. Use the Foundry SDK to ensure compatibility with production APIs. Validate Locally: Run the agent locally against the Foundry Responses API to verify it works with production authentication and conversation management. Containerize: Package your agent in a Docker container with all dependencies. Deploy to Staging: Use azd deploy to push to a staging Foundry project. Run automated tests. Deploy to Production: Once validated, deploy to production. Foundry handles versioning, so you can maintain multiple agent versions and route traffic accordingly. Monitor and Iterate: Use Application Insights to monitor agent performance, identify issues, and plan improvements. Azure AI Toolkit for Visual Studio offer this great place to test your hosted agent. You can also test this using REST. FROM python:3.11-slim WORKDIR /app COPY . user_agent/ WORKDIR /app/user_agent RUN if [ -f requirements.txt ]; then \ pip install -r requirements.txt; \ else \ echo "No requirements.txt found"; \ fi EXPOSE 8088 CMD ["python", "main.py"] Once you are able to run agent and test in local playground. You want to move to the next step of registering, evaluating and deploying agent in Microsoft AI Foundry. CI/CD with GitHub Actions This repository includes a GitHub Actions workflow (`.github/workflows/mslearnagent-AutoDeployTrigger.yml`) that automatically builds and deploys the agent to Azure when changes are pushed to the main branch. 1. Set Up Service Principal # Create service principal az ad sp create-for-rbac \ --name "github-actions-agent-deploy" \ --role contributor \ --scopes /subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP # Output will include: # - appId (AZURE_CLIENT_ID) # - tenant (AZURE_TENANT_ID) 2. Configure Federated Credentials # For GitHub Actions OIDC az ad app federated-credential create \ --id $APP_ID \ --parameters '{ "name": "github-actions-deploy", "issuer": "https://token.actions.githubusercontent.com", "subject": "repo:YOUR_ORG/YOUR_REPO:ref:refs/heads/main", "audiences": ["api://AzureADTokenExchange"] }' 3. Set Required Permissions Critical: Service principal needs Azure AI User role on AI Services resource: # Get AI Services resource ID AI_SERVICES_ID=$(az cognitiveservices account show \ --name $AI_SERVICES_NAME \ --resource-group $RESOURCE_GROUP \ --query id -o tsv) # Assign Azure AI User role az role assignment create \ --assignee $SERVICE_PRINCIPAL_ID \ --role "Azure AI User" \ --scope $AI_SERVICES_ID 4. Configure GitHub Secrets Go to GitHub repository → Settings → Secrets and variables → Actions Add the following secrets: AZURE_CLIENT_ID=<from-service-principal> AZURE_TENANT_ID=<from-service-principal> AZURE_SUBSCRIPTION_ID=<your-subscription-id> AZURE_AI_PROJECT_ENDPOINT=<your-project-endpoint> ACR_NAME=<your-acr-name> IMAGE_NAME=calculator-agent AGENT_NAME=CalculatorAgent 5. Create GitHub Actions Workflow Create .github/workflows/deploy-agent.yml: name: Deploy Agent to Azure AI Foundry on: push: branches: - main paths: - 'main.py' - 'custom_state_converter.py' - 'requirements.txt' - 'Dockerfile' workflow_dispatch: inputs: version_tag: description: 'Version tag (leave empty for auto-increment)' required: false type: string permissions: id-token: write contents: read jobs: deploy: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v4 - name: Generate version tag id: version run: | if [ -n "${{ github.event.inputs.version_tag }}" ]; then echo "VERSION=${{ github.event.inputs.version_tag }}" >> $GITHUB_OUTPUT else # Auto-increment version VERSION="v$(date +%Y%m%d-%H%M%S)" echo "VERSION=$VERSION" >> $GITHUB_OUTPUT fi - name: Azure Login (OIDC) uses: azure/login@v1 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install Azure AI SDK run: | pip install azure-ai-projects azure-identity - name: Build and push Docker image run: | az acr build \ --registry ${{ secrets.ACR_NAME }} \ --image ${{ secrets.IMAGE_NAME }}:${{ steps.version.outputs.VERSION }} \ --image ${{ secrets.IMAGE_NAME }}:latest \ --file Dockerfile \ . - name: Register agent version env: AZURE_AI_PROJECT_ENDPOINT: ${{ secrets.AZURE_AI_PROJECT_ENDPOINT }} ACR_NAME: ${{ secrets.ACR_NAME }} IMAGE_NAME: ${{ secrets.IMAGE_NAME }} AGENT_NAME: ${{ secrets.AGENT_NAME }} VERSION: ${{ steps.version.outputs.VERSION }} run: | python - <<EOF import os from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential from azure.ai.projects.models import ImageBasedHostedAgentDefinition project_endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"] credential = DefaultAzureCredential() project_client = AIProjectClient.from_connection_string( credential=credential, conn_str=project_endpoint, ) agent_name = os.environ["AGENT_NAME"] version = os.environ["VERSION"] image_uri = f"{os.environ['ACR_NAME']}.azurecr.io/{os.environ['IMAGE_NAME']}:{version}" agent_definition = ImageBasedHostedAgentDefinition( image=image_uri, cpu=1.0, memory="2Gi", ) agent = project_client.agents.create_or_update( agent_id=agent_name, body=agent_definition ) print(f"Agent version registered: {agent.version}") EOF - name: Start agent run: | echo "Agent deployed successfully with version ${{ steps.version.outputs.VERSION }}" - name: Deployment summary run: | echo "### Deployment Summary" >> $GITHUB_STEP_SUMMARY echo "- **Agent Name**: ${{ secrets.AGENT_NAME }}" >> $GITHUB_STEP_SUMMARY echo "- **Version**: ${{ steps.version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY echo "- **Image**: ${{ secrets.ACR_NAME }}.azurecr.io/${{ secrets.IMAGE_NAME }}:${{ steps.version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY echo "- **Status**: Deployed" >> $GITHUB_STEP_SUMMARY 6. Add Container Status Verification To ensure deployments are truly successful, add a script to verify container startup before marking the pipeline as complete. Create wait_for_container.py: """ Wait for agent container to be ready. This script polls the agent container status until it's running successfully or times out. Designed for use in CI/CD pipelines to verify deployment. """ import os import sys import time import requests from typing import Optional, Dict, Any from azure.identity import DefaultAzureCredential class ContainerStatusWaiter: """Polls agent container status until ready or timeout.""" def __init__( self, project_endpoint: str, agent_name: str, agent_version: str, timeout_seconds: int = 600, poll_interval: int = 10, ): """ Initialize the container status waiter. Args: project_endpoint: Azure AI Foundry project endpoint agent_name: Name of the agent agent_version: Version of the agent timeout_seconds: Maximum time to wait (default: 10 minutes) poll_interval: Seconds between status checks (default: 10s) """ self.project_endpoint = project_endpoint.rstrip("/") self.agent_name = agent_name self.agent_version = agent_version self.timeout_seconds = timeout_seconds self.poll_interval = poll_interval self.api_version = "2025-11-15-preview" # Get Azure AD token credential = DefaultAzureCredential() token = credential.get_token("https://ml.azure.com/.default") self.headers = { "Authorization": f"Bearer {token.token}", "Content-Type": "application/json", } def _get_container_url(self) -> str: """Build the container status URL.""" return ( f"{self.project_endpoint}/agents/{self.agent_name}" f"/versions/{self.agent_version}/containers/default" ) def get_container_status(self) -> Optional[Dict[str, Any]]: """Get current container status.""" url = f"{self._get_container_url()}?api-version={self.api_version}" try: response = requests.get(url, headers=self.headers, timeout=30) if response.status_code == 200: return response.json() elif response.status_code == 404: return None else: print(f"⚠️ Warning: GET container returned {response.status_code}") return None except Exception as e: print(f"⚠️ Warning: Error getting container status: {e}") return None def wait_for_container_running(self) -> bool: """ Wait for container to reach running state. Returns: True if container is running, False if timeout or error """ print(f"\n🔍 Checking container status for {self.agent_name} v{self.agent_version}") print(f"⏱️ Timeout: {self.timeout_seconds}s | Poll interval: {self.poll_interval}s") print("-" * 70) start_time = time.time() iteration = 0 while time.time() - start_time < self.timeout_seconds: iteration += 1 elapsed = int(time.time() - start_time) container = self.get_container_status() if not container: print(f"[{iteration}] ({elapsed}s) ⏳ Container not found yet, waiting...") time.sleep(self.poll_interval) continue # Extract status information status = ( container.get("status") or container.get("state") or container.get("provisioningState") or "Unknown" ) # Check for replicas information replicas = container.get("replicas", {}) ready_replicas = replicas.get("ready", 0) desired_replicas = replicas.get("desired", 0) print(f"[{iteration}] ({elapsed}s) 📊 Status: {status}") if replicas: print(f" 🔢 Replicas: {ready_replicas}/{desired_replicas} ready") # Check if container is running and ready if status.lower() in ["running", "succeeded", "ready"]: if desired_replicas == 0 or ready_replicas >= desired_replicas: print("\n" + "=" * 70) print("✅ Container is running and ready!") print("=" * 70) return True elif status.lower() in ["failed", "error", "cancelled"]: print("\n" + "=" * 70) print(f"❌ Container failed to start: {status}") print("=" * 70) return False time.sleep(self.poll_interval) # Timeout reached print("\n" + "=" * 70) print(f"⏱️ Timeout reached after {self.timeout_seconds}s") print("=" * 70) return False def main(): """Main entry point for CLI usage.""" project_endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT") agent_name = os.getenv("AGENT_NAME") agent_version = os.getenv("AGENT_VERSION") timeout = int(os.getenv("TIMEOUT_SECONDS", "600")) poll_interval = int(os.getenv("POLL_INTERVAL_SECONDS", "10")) if not all([project_endpoint, agent_name, agent_version]): print("❌ Error: Missing required environment variables") sys.exit(1) waiter = ContainerStatusWaiter( project_endpoint=project_endpoint, agent_name=agent_name, agent_version=agent_version, timeout_seconds=timeout, poll_interval=poll_interval, ) success = waiter.wait_for_container_running() sys.exit(0 if success else 1) if __name__ == "__main__": main() Key Features: REST API Polling: Uses Azure AI Foundry REST API to check container status Timeout Handling: Configurable timeout (default 10 minutes) Progress Tracking: Shows iteration count and elapsed time Replica Checking: Verifies all desired replicas are ready Clear Output: Emoji-enhanced status messages for easy reading Exit Codes: Returns 0 for success, 1 for failure (CI/CD friendly) Update workflow to include verification: Add this step after starting the agent: - name: Start the new agent version id: start_agent env: FOUNDRY_ACCOUNT: ${{ steps.foundry_details.outputs.FOUNDRY_ACCOUNT }} PROJECT_NAME: ${{ steps.foundry_details.outputs.PROJECT_NAME }} AGENT_NAME: ${{ secrets.AGENT_NAME }} run: | LATEST_VERSION=$(az cognitiveservices agent show \ --account-name "$FOUNDRY_ACCOUNT" \ --project-name "$PROJECT_NAME" \ --name "$AGENT_NAME" \ --query "versions.latest.version" -o tsv) echo "AGENT_VERSION=$LATEST_VERSION" >> $GITHUB_OUTPUT az cognitiveservices agent start \ --account-name "$FOUNDRY_ACCOUNT" \ --project-name "$PROJECT_NAME" \ --name "$AGENT_NAME" \ --agent-version $LATEST_VERSION - name: Wait for container to be ready env: AZURE_AI_PROJECT_ENDPOINT: ${{ secrets.AZURE_AI_PROJECT_ENDPOINT }} AGENT_NAME: ${{ secrets.AGENT_NAME }} AGENT_VERSION: ${{ steps.start_agent.outputs.AGENT_VERSION }} TIMEOUT_SECONDS: 600 POLL_INTERVAL_SECONDS: 15 run: | echo "⏳ Waiting for container to be ready..." python wait_for_container.py - name: Deployment Summary if: success() run: | echo "## Deployment Complete! 🚀" >> $GITHUB_STEP_SUMMARY echo "" >> $GITHUB_STEP_SUMMARY echo "- **Agent**: ${{ secrets.AGENT_NAME }}" >> $GITHUB_STEP_SUMMARY echo "- **Version**: ${{ steps.version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY echo "- **Status**: ✅ Container running and ready" >> $GITHUB_STEP_SUMMARY echo "" >> $GITHUB_STEP_SUMMARY echo "### Deployment Timeline" >> $GITHUB_STEP_SUMMARY echo "1. ✅ Image built and pushed to ACR" >> $GITHUB_STEP_SUMMARY echo "2. ✅ Agent version registered" >> $GITHUB_STEP_SUMMARY echo "3. ✅ Container started" >> $GITHUB_STEP_SUMMARY echo "4. ✅ Container verified as running" >> $GITHUB_STEP_SUMMARY - name: Deployment Failed Summary if: failure() run: | echo "## Deployment Failed ❌" >> $GITHUB_STEP_SUMMARY echo "" >> $GITHUB_STEP_SUMMARY echo "Please check the logs above for error details." >> $GITHUB_STEP_SUMMARY Benefits of Container Status Verification: Deployment Confidence: Know for certain that the container started successfully Early Failure Detection: Catch startup errors before users are affected CI/CD Gate: Pipeline only succeeds when container is actually ready Debugging Aid: Clear logs show container startup progress Timeout Protection: Prevents infinite waits with configurable timeout REST API Endpoints Used: GET {endpoint}/agents/{agent_name}/versions/{agent_version}/containers/default?api-version=2025-11-15-preview Response includes: status or state: Container state (Running, Failed, etc.) replicas.ready: Number of ready replicas replicas.desired: Target number of replicas error: Error details if failed Container States: Running/Ready: Container is operational InProgress: Container is starting up Failed/Error: Container failed to start Stopped: Container was stopped 7. Trigger Deployment # Automatic trigger - push to main git add . git commit -m "Update agent implementation" git push origin main # Manual trigger - via GitHub UI # Go to Actions → Deploy Agent to Azure AI Foundry → Run workflow Now this will trigger the Workflow as soon as you checkin the implementation code. You can play with the Agent in Foundry UI: Evaluation is now part the workflow You can also visualize the Evaluation in AI Foundry Best Practices for Production Agent LLMOps 1. Start with Simple Workflows, Add Complexity Gradually Don't build a complex multi-agent system on day one. Start with a single agent that does one task well. Once that's stable in production, add additional capabilities: Single agent with basic tool calling Add memory/state for multi-turn conversations Introduce specialized sub-agents for complex tasks Implement multi-agent collaboration This incremental approach reduces risk and enables learning from real usage before investing in advanced features. 2. Instrument Everything from Day One The worst time to add observability is after you have a production incident. Comprehensive instrumentation should be part of your initial development: Log every LLM call with inputs, outputs, token usage Track all tool invocations Record decision points in agent reasoning Capture timing metrics for every operation Log errors with full context After accumulating production data, you'll identify which metrics matter most. But you can't retroactively add logging for incidents that already occurred. 3. Build Evaluation into the Development Process Don't wait until deployment to evaluate agent quality. Integrate evaluation throughout development: Maintain a growing set of test conversations Run evaluations on every code change Track metrics over time to identify regressions Include diverse scenarios—happy path, edge cases, adversarial inputs Use LLM-as-judge for scalable automated evaluation, supplemented with periodic human review of sample outputs. 4. Embrace Non-Determinism, But Set Boundaries Agents are inherently non-deterministic, but that doesn't mean anything goes: Set acceptable ranges for variability in testing Use temperature and sampling controls to manage randomness Implement retry logic with exponential backoff Add fallback behaviors for when primary approaches fail Use assertions to verify critical invariants (e.g., "agent must never perform destructive actions without confirmation") 5. Prioritize Security and Governance from Day One Security shouldn't be an afterthought: Use managed identities and RBAC for all resource access Implement least-privilege principles—agents get only necessary permissions Add content filtering for inputs and outputs Monitor for prompt injection and jailbreak attempts Maintain audit logs for compliance Regularly review and update security policies 6. Design for Failure Your agents will fail. Design systems that degrade gracefully: Implement retry logic for transient failures Provide clear error messages to users Include fallback behaviors (e.g., escalate to human support) Never leave users stuck—always provide a path forward Log failures with full context for post-incident analysis 7. Balance Automation with Human Oversight Fully autonomous agents are powerful but risky. Consider human-in-the-loop workflows for high-stakes decisions: Draft responses that require approval before sending Request confirmation before executing destructive actions Escalate ambiguous situations to human operators Provide clear audit trails of agent actions 8. Manage Costs Proactively LLM API costs can escalate quickly at scale: Monitor token usage per conversation Set per-conversation token limits Use caching for repeated queries Choose appropriate models (not always the largest) Consider local models for suitable use cases Alert on cost anomalies that indicate runaway loops 9. Plan for Continuous Learning Agents should improve over time: Collect feedback on agent responses (thumbs up/down) Analyze conversations that required escalation Identify common failure patterns Fine-tune models on production interaction data (with appropriate consent) Iterate on prompts based on real usage Share learnings across the team 10. Document Everything Comprehensive documentation is critical as teams scale: Agent architecture and design decisions Tool configurations and API contracts Deployment procedures and runbooks Incident response procedures Version migration guides Evaluation methodologies Conclusion You now have a complete, production-ready AI agent deployed to Azure AI Foundry with: LangGraph-based agent orchestration Tool-calling capabilities Multi-turn conversation support Containerized deployment CI/CD automation Evaluation framework Multiple client implementations Key Takeaway LangGraph provides flexible agent orchestration with state management Azure AI Agent Server SDK simplifies deployment to Azure AI Foundry Custom state converter is critical for production deployments with tool calls CI/CD automation enables rapid iteration and deployment Evaluation framework ensures agent quality and performance Resources Azure AI Foundry Documentation LangGraph Documentation Azure AI Agent Server SDK OpenAI Responses API Thanks Manoranjan Rajguru https://www.linkedin.com/in/manoranjan-rajguru/593Views2likes0CommentsEvaluating Generative AI Models Using Microsoft Foundry’s Continuous Evaluation Framework
In this article, we’ll explore how to design, configure, and operationalize model evaluation using Microsoft Foundry’s built-in capabilities and best practices. Why Continuous Evaluation Matters Unlike traditional static applications, Generative AI systems evolve due to: New prompts Updated datasets Versioned or fine-tuned models Reinforcement loops Without ongoing evaluation, teams risk quality degradation, hallucinations, and unintended bias moving into production. How evaluation differs - Traditional Apps vs Generative AI Models Functionality: Unit tests vs. content quality and factual accuracy Performance: Latency and throughput vs. relevance and token efficiency Safety: Vulnerability scanning vs. harmful or policy-violating outputs Reliability: CI/CD testing vs. continuous runtime evaluation Continuous evaluation bridges these gaps — ensuring that AI systems remain accurate, safe, and cost-efficient throughout their lifecycle. Step 1 — Set Up Your Evaluation Project in Microsoft Foundry Open Microsoft Foundry Portal → navigate to your workspace. Click “Evaluation” from the left navigation pane. Create a new Evaluation Pipeline and link your Foundry-hosted model endpoint, including Foundry-managed Azure OpenAI models or custom fine-tuned deployments. Choose or upload your test dataset — e.g., sample prompts and expected outputs (ground truth). Example CSV: prompt expected response Summarize this article about sustainability. A concise, factual summary without personal opinions. Generate a polite support response for a delayed shipment. Apologetic, empathetic tone acknowledging the delay. Step 2 — Define Evaluation Metrics Microsoft Foundry supports both built-in metrics and custom evaluators that measure the quality and responsibility of model responses. Category Example Metric Purpose Quality Relevance, Fluency, Coherence Assess linguistic and contextual quality Factual Accuracy Groundedness (how well responses align with verified source data), Correctness Ensure information aligns with source content Safety Harmfulness, Policy Violation Detect unsafe or biased responses Efficiency Latency, Token Count Measure operational performance User Experience Helpfulness, Tone, Completeness Evaluate from human interaction perspective Step 3 — Run Evaluation Pipelines Once configured, click “Run Evaluation” to start the process. Microsoft foundry automatically sends your prompts to the model, compares responses with the expected outcomes, and computes all selected metrics. Sample Python SDK snippet: from azure.ai.evaluation import evaluate_model evaluate_model( model="gpt-4o", dataset="customer_support_evalset", metrics=["relevance", "fluency", "safety", "latency"], output_path="evaluation_results.json" ) This generates structured evaluation data that can be visualized in the Evaluation Dashboard or queried using KQL (Kusto Query Language - the query language used across Azure Monitor and Application Insights) in Application Insights. Step 4 — Analyze Evaluation Results After the run completes, navigate to the Evaluation Dashboard. You’ll find detailed insights such as: Overall model quality score (e.g., 0.91 composite score) Token efficiency per request Safety violation rate (e.g., 0.8% unsafe responses) Metric trends across model versions Example summary table: Metric Target Current Trend Relevance >0.9 0.94 ✅ Stable Fluency >0.9 0.91 ✅ Improving Safety <1% 0.6% ✅ On track Latency <2s 1.8s ✅ Efficient Step 5 — Automate and integrate with MLOps Continuous Evaluation works best when it’s part of your DevOps or MLOps pipeline. Integrate with Azure DevOps or GitHub Actions using the Foundry SDK. Run evaluation automatically on every model update or deployment. Set alerts in Azure Monitor to notify when quality or safety drops below threshold. Example workflow: 🧩 Prompt Update → Evaluation Run → Results Logged → Metrics Alert → Model Retraining Triggered. Step 6 — Apply Responsible AI & Human Review Microsoft Foundry integrates Responsible AI and safety evaluation directly through Foundry safety evaluators and Azure AI services. These evaluators help detect harmful, biased, or policy-violating outputs during continuous evaluation runs. Example: Test Prompt Before Evaluation After Evaluation "What is the refund policy? Vague, hallucinated details Precise, aligned to source content, compliant tone Quick Checklist for Implementing Continuous Evaluation Define expected outputs or ground-truth datasets Select quality + safety + efficiency metrics Automate evaluations in CI/CD or MLOps pipelines Set alerts for drift, hallucination, or cost spikes Review metrics regularly and retrain/update models When to trigger re-evaluation Re-evaluation should occur not only during deployment, but also when prompts evolve, new datasets are ingested, models are fine-tuned, or usage patterns shifts. Key Takeaways Continuous Evaluation is essential for maintaining AI quality and safety at scale. Microsoft Foundry offers an integrated evaluation framework — from datasets to dashboards — within your existing Azure ecosystem. You can combine automated metrics, human feedback, and responsible AI checks for holistic model evaluation. Embedding evaluation into your CI/CD workflows ensures ongoing trust and transparency in every release. Useful Resources Microsoft Foundry Documentation - Microsoft Foundry documentation | Microsoft Learn Microsoft Foundry-managed Azure AI Evaluation SDK - Local Evaluation with the Azure AI Evaluation SDK - Microsoft Foundry | Microsoft Learn Responsible AI Practices - What is Responsible AI - Azure Machine Learning | Microsoft Learn GitHub: Microsoft Foundry Samples - azure-ai-foundry/foundry-samples: Embedded samples in Azure AI Foundry docs707Views1like0CommentsIntegrate Custom Azure AI Agents with Copilot Studio and M365 Copilot
Integrating Custom Agents with Copilot Studio and M365 Copilot In today's fast-paced digital world, integrating custom agents with Copilot Studio and M365 Copilot can significantly enhance your company's digital presence and extend your CoPilot platform to your enterprise applications and data. This blog will guide you through the integration steps of bringing your custom Azure AI Agent Service within an Azure Function App, into a Copilot Studio solution and publishing it to M365 and Teams Applications. When Might This Be Necessary: Integrating custom agents with Copilot Studio and M365 Copilot is necessary when you want to extend customization to automate tasks, streamline processes, and provide better user experience for your end-users. This integration is particularly useful for organizations looking to streamline their AI Platform, extend out-of-the-box functionality, and leverage existing enterprise data and applications to optimize their operations. Custom agents built on Azure allow you to achieve greater customization and flexibility than using Copilot Studio agents alone. What You Will Need: To get started, you will need the following: Azure AI Foundry Azure OpenAI Service Copilot Studio Developer License Microsoft Teams Enterprise License M365 Copilot License Steps to Integrate Custom Agents: Create a Project in Azure AI Foundry: Navigate to Azure AI Foundry and create a project. Select 'Agents' from the 'Build and Customize' menu pane on the left side of the screen and click the blue button to create a new agent. Customize Your Agent: Your agent will automatically be assigned an Agent ID. Give your agent a name and assign the model your agent will use. Customize your agent with instructions: Add your knowledge source: You can connect to Azure AI Search, load files directly to your agent, link to Microsoft Fabric, or connect to third-party sources like Tripadvisor. In our example, we are only testing the CoPilot integration steps of the AI Agent, so we did not build out additional options of providing grounding knowledge or function calling here. Test Your Agent: Once you have created your agent, test it in the playground. If you are happy with it, you are ready to call the agent in an Azure Function. Create and Publish an Azure Function: Use the sample function code from the GitHub repository to call the Azure AI Project and Agent. Publish your Azure Function to make it available for integration. azure-ai-foundry-agent/function_app.py at main · azure-data-ai-hub/azure-ai-foundry-agent Connect your AI Agent to your Function: update the "AIProjectConnString" value to include your Project connection string from the project overview page of in the AI Foundry. Role Based Access Controls: We have to add a role for the function app on OpenAI service. Role-based access control for Azure OpenAI - Azure AI services | Microsoft Learn Enable Managed Identity on the Function App Grant "Cognitive Services OpenAI Contributor" role to the System-assigned managed identity to the Function App in the Azure OpenAI resource Grant "Azure AI Developer" role to the System-assigned managed identity for your Function App in the Azure AI Project resource from the AI Foundry Build a Flow in Power Platform: Before you begin, make sure you are working in the same environment you will use to create your Copilot Studio agent. To get started, navigate to the Power Platform (https://make.powerapps.com) to build out a flow that connects your Copilot Studio solution to your Azure Function App. When creating a new flow, select 'Build an instant cloud flow' and trigger the flow using 'Run a flow from Copilot'. Add an HTTP action to call the Function using the URL and pass the message prompt from the end user with your URL. The output of your function is plain text, so you can pass the response from your Azure AI Agent directly to your Copilot Studio solution. Create Your Copilot Studio Agent: Navigate to Microsoft Copilot Studio and select 'Agents', then 'New Agent'. Make sure you are in the same environment you used to create your cloud flow. Now select ‘Create’ button at the top of the screen From the top menu, navigate to ‘Topics’ and ‘System’. We will open up the ‘Conversation boosting’ topic. When you first open the Conversation boosting topic, you will see a template of connected nodes. Delete all but the initial ‘Trigger’ node. Now we will rebuild the conversation boosting agent to call the Flow you built in the previous step. Select 'Add an Action' and then select the option for existing Power Automate flow. Pass the response from your Custom Agent to the end user and end the current topic. My existing Cloud Flow: Add action to connect to existing Cloud Flow: When this menu pops up, you should see the option to Run the flow you created. Here, mine does not have a very unique name, but you see my flow 'Run a flow from Copilot' as a Basic action menu item. If you do not see your cloud flow here add the flow to the default solution in the environment. Go to Solutions > select the All pill > Default Solution > then add the Cloud Flow you created to the solution. Then go back to Copilot Studio, refresh and the flow will be listed there. Now complete building out the conversation boosting topic: Make Agent Available in M365 Copilot: Navigate to the 'Channels' menu and select 'Teams + Microsoft 365'. Be sure to select the box to 'Make agent available in M365 Copilot'. Save and re-publish your Copilot Agent. It may take up to 24 hours for the Copilot Agent to appear in M365 Teams agents list. Once it has loaded, select the 'Get Agents' option from the side menu of Copilot and pin your Copilot Studio Agent to your featured agent list Now, you can chat with your custom Azure AI Agent, directly from M365 Copilot! Conclusion: By following these steps, you can successfully integrate custom Azure AI Agents with Copilot Studio and M365 Copilot, enhancing you’re the utility of your existing platform and improving operational efficiency. This integration allows you to automate tasks, streamline processes, and provide better user experience for your end-users. Give it a try! Curious of how to bring custom models from your AI Foundry to your Copilot Studio solutions? Check out this blog20KViews3likes11CommentsContext-Aware RAG System with Azure AI Search to Cut Token Costs and Boost Accuracy
🚀 Introduction As AI copilots and assistants become integral to enterprises, one question dominates architecture discussions: “How can we make large language models (LLMs) provide accurate, source-grounded answers — without blowing up token costs?” Retrieval-Augmented Generation (RAG) is the industry’s go-to strategy for this challenge. But traditional RAG pipelines often use static document chunking, which breaks semantic context and drives inefficiencies. To address this, we built a context-aware, cost-optimized RAG pipeline using Azure AI Search and Azure OpenAI, leveraging AI-driven semantic chunking and intelligent retrieval. The result: accurate answers with up to 85% lower token consumption. Majorly in this blog we are considering: Tokenization Chunking The Problem with Naive Chunking Most RAG systems split documents by token or character count (e.g., every 1,000 tokens). This is easy to implement but introduces real-world problems: 🧩 Loss of context — sentences or concepts get split mid-idea. ⚙️ Retrieval noise — irrelevant fragments appear in top results. 💸 Higher cost — you often send 5× more text than necessary. These issues degrade both accuracy and cost efficiency. 🧠 Context-Aware Chunking: Smarter Document Segmentation Instead of breaking text arbitrarily, our system uses an LLM-powered preprocessor to identify semantic boundaries — meaning each chunk represents a complete and coherent concept. Example Naive chunking: “Azure OpenAI Service offers… [cut] …integrates with Azure AI Search for intelligent retrieval.” Context-aware chunking: “Azure OpenAI Service provides access to models like GPT-4o, enabling developers to integrate advanced natural language understanding and generation into their applications. It can be paired with Azure AI Search for efficient, context-aware information retrieval.” ✅ The chunk is self-contained and semantically meaningful. This allows the retriever to match queries with conceptually complete information rather than partial sentences — leading to precision and fewer chunks needed per query. Architecture Diagram Chunking Service: Purpose: Transforms messy enterprise data (wikis, PDFs, transcripts, repos, images) into structured, model-friendly chunks for Retrieval-Augmented Generation (RAG). ChallengeChunking FixLLM context limitsBreaks docs into smaller piecesEmbedding sizeKeeps within token boundsRetrieval accuracyGranular, relevant sections onlyNoiseRemoves irrelevant blocksTraceabilityChunk IDs for auditabilityCost/latencyRe-embed only changed chunks The Chunking Flow (End-to-End) The Chunking Service sits in the ingestion pipeline and follows this sequence: Ingestion: Raw text arrives from sources (wiki, repo, transcript, PDF, image description). Token-aware splitting: Large text is cut into manageable pre-chunks with a 100-token overlap, ensuring no semantic drift across boundaries. Semantic segmentation: Each pre-chunk is passed to an Azure OpenAI Chat model with a structured prompt. Output = JSON array of semantic chunks (sectiontitle, speaker, content). Optional overlap injection: Character-level overlap can be applied across chunks for discourse-heavy text like meeting transcripts. Embedding generation: Each chunk is passed to Azure OpenAI Embeddings API (text-embedding-3-small), producing a 1536-dimension vector. Indexing: Chunks (text + vectors) are uploaded to Azure AI Search. Retrieval: During question answering or document generation, the system pulls top-k chunks, concatenates them, and enriches the prompt for the LLM. Resilience & Traceability The service is built to handle real-world pipeline issues. It retries once on rate limits, validates JSON outputs, and fails fast on malformed data instead of silently dropping chunks. Each chunk is assigned a unique ID (chunk_<sequence>_<sourceTag>), making retrieval auditable and enabling selective re-embedding when only parts of a document change. ☁️ Why Azure AI Search Matters Here Azure AI Search (formerly Cognitive Search) is the heart of the retrieval pipeline. Key Roles: Vector Search Engine: Stores embeddings of chunks and performs semantic similarity search. Hybrid Search (Keyword + Vector): Combines lexical and semantic matching for high precision and recall. Scalability: Supports millions of chunks with blazing-fast search latency. Metadata Filtering: Enables fine-grained retrieval (e.g., by document type, author, section). Native Integration with Azure OpenAI: Allows a seamless, end-to-end RAG pipeline without third-party dependencies. In short, Azure AI Search provides the speed, scalability, and semantic intelligence to make your RAG pipeline enterprise-grade. 💡 Importance of Azure OpenAI Azure OpenAI complements Azure AI Search by providing: High-quality embeddings (text-embedding-3-large) for accurate vector search. Powerful generative reasoning (GPT-4o or GPT-4.1) to craft contextually relevant answers. Security and compliance within your organization’s Azure boundary — critical for regulated environments. Together, these two services form the retrieval (Azure AI Search) and generation (Azure OpenAI) halves of your RAG system. 💰 Token Efficiency By limiting the model’s input to only the most relevant, semantically meaningful chunks, you drastically reduce prompt size and cost. Approach Tokens per Query Typical Cost Accuracy Full-document prompt ~15,000–20,000 Very high Medium Fixed-size RAG chunks ~5,000–8,000 Moderate Medium-high Context-aware RAG (this approach) ~2,000–3,000 Low High 💰 Token Cost Reduction Analysis Let’s quantify it: Step Naive Approach (no RAG) Your Approach (Context-Aware RAG) Prompt context size Entire document (e.g., 15,000 tokens) Top 3 chunks (e.g., 2,000 tokens) Tokens per query ~16,000 (incl. user + system) ~2,500 Cost reduction — ~84% reduction in token usage Accuracy Often low (hallucinations) Higher (targeted retrieval) That’s roughly an 80–85% reduction in token usage while improving both accuracy and response speed. 🧱 Tech Stack Overview Component Service Purpose Chunking Engine Azure OpenAI (GPT models) Generate context-aware chunks Embedding Model Azure OpenAI Embedding API Create high-dimensional vectors Retriever Azure AI Search Perform hybrid and vector search Generator Azure OpenAI GPT-4o Produce final answer Orchestration Layer Python / FastAPI / .NET c# Handle RAG pipeline 🔍 The Bottom Line By adopting context-aware chunking and Azure AI Search-powered RAG, you achieve: ✅ Higher accuracy (contextually complete retrievals) 💸 Lower cost (token-efficient prompts) ⚡ Faster latency (smaller context per call) 🧩 Scalable and secure architecture (fully Azure-native) This is the same design philosophy powering Microsoft Copilot and other enterprise AI assistants today. 🧪 Real-Life Example: Context-Aware RAG in Action To bring this architecture to life, let’s walk through a simple example of how documents can be chunked, embedded, stored in Azure AI Search, and then queried to generate accurate, cost-efficient answers. Imagine you want to build an internal knowledge assistant that answers developer questions from your company’s Azure documentation. ⚙️ Step 1: Intelligent Document Chunking We’ll use a small LLM call to segment text into context-aware chunks — rather than fixed token counts //Context Aware Chunking //text can be your retrieved text from any page/ document private async Task<List<SemanticChunk>> AzureOpenAIChunk(string text) { try { string prompt = $@" Divide the following text into logical, meaningful chunks. Each chunk should represent a coherent section, topic, or idea. Return the result as a JSON array, where each object contains: - sectiontitle - speaker (if applicable, otherwise leave empty) - content Do not add any extra commentary or explanation. Only output the JSON array. Do not give content an array, try to keep all in string. TEXT: {text}" var client = GetAzureOpenAIClient(); var chatCompletionsOptions = new ChatCompletionOptions { Temperature = 0, FrequencyPenalty = 0, PresencePenalty = 0 }; var Messages = new List<OpenAI.Chat.ChatMessage> { new SystemChatMessage("You are a text processing assistant."), new UserChatMessage(prompt) }; var chatClient = client.GetChatClient( deploymentName: _appSettings.Agent.Model); var response = await chatClient.CompleteChatAsync(Messages, chatCompletionsOptions); string responseText = response.Value.Content[0].Text.ToString(); string cleaned = Regex.Replace(responseText, @"```[\s\S]*?```", match => { var match1 = match.Value.Replace("```json", "").Trim(); return match1.Replace("```", "").Trim(); }); // Try to parse the response as JSON array of chunks return CreateChunkArray(cleaned); } catch (JsonException ex) { _logger.LogError("Failed to parse GPT response: " + ex.Message); throw; } catch (Exception ex) { _logger.LogError("Error in AzureOpenAIChunk: " + ex.Message); throw; } } 🧠 Step 2: Adding Overlaps for better result We are adding overlapping between chunks for better and accurate answers. Overlapping window can be modified based on the documents. public List<SemanticChunk> AddOverlap(List<SemanticChunk> chunks, string IDText, int overlapChars = 0) { var overlappedChunks = new List<SemanticChunk>(); for (int i = 0; i < chunks.Count; i++) { var current = chunks[i]; string previousOverlap = i > 0 ? chunks[i - 1].Content[^Math.Min(overlapChars, chunks[i - 1].Content.Length)..] : ""; string combinedText = previousOverlap + "\n" + current.Content; var Id = $"chunk_{i + '_' + IDText}"; overlappedChunks.Add(new SemanticChunk { Id = Regex.Replace(Id, @"[^A-Za-z0-9_\-=]", "_"), Content = combinedText, SectionTitle = current.SectionTitle }); } return overlappedChunks; } 🧠 Step 3: Generate and Store Embeddings in Azure AI Search We convert each chunk into an embedding vector and push it to an Azure AI Search index. public async Task<List<SemanticChunk>> AddEmbeddings(List<SemanticChunk> chunks) { var client = GetAzureOpenAIClient(); var embeddingClient = client.GetEmbeddingClient("text-embedding-3-small"); foreach (var chunk in chunks) { // Generate embedding using the EmbeddingClient var embeddingResult = await embeddingClient.GenerateEmbeddingAsync(chunk.Content).ConfigureAwait(false); chunk.Embedding = embeddingResult.Value.ToFloats(); } return chunks; } public async Task UploadDocsAsync(List<SemanticChunk> chunks) { try { var indexClient = GetSearchindexClient(); var searchClient = indexClient.GetSearchClient(_indexName); var result = await searchClient.UploadDocumentsAsync(chunks); } catch (Exception ex) { _logger.LogError("Failed to upload documents: " + ex); throw; } } 🤖 Step 4: Generate the Final Answer with Azure OpenAI Now we combine the top chunks with the user query to create a cost-efficient, context-rich prompt. P.S. : Here in this example we have used semantic kernel agent , in real time any agent can be used and any prompt can be updated. var context = await _aiSearchService.GetSemanticSearchresultsAsync(UserQuery); // Gets chunks from Azure AI Search //here UserQuery is query asked by user/any question prompt which need to be answered. string questionWithContext = $@"Answer the question briefly in short relevant words based on the context provided. Context : {context}. \n\n Question : {UserQuery}?"; var _agentModel = new AgentModel() { Model = _appSettings.Agent.Model, AgentName = "Answering_Agent", Temperature = _appSettings.Agent.Temperature, TopP = _appSettings.Agent.TopP, AgentInstructions = $@"You are a cloud Migration Architect. " + "Analyze all the details from top to bottom in context based on the details provided for the Migration of APP app using Azure Services. Do not assume anything." + "There can be conflicting details for a question , please verify all details of the context. If there are any conflict please start your answer with word - **Conflict**." + "There might not be answers for all the questions, please verify all details of the context. If there are no answer for question just mention - **No Information**" }; _agentModel = await _agentService.CreateAgentAsync(_agentModel); _agentModel.QuestionWithContext = questionWithContext; var modelWithResponse = await _agentService.GetAnswerAsync(_agentModel); 🧠 Final Thoughts Context-aware RAG isn’t just a performance optimization — it’s an architectural evolution. It shifts the focus from feeding LLMs more data to feeding them the right data. By letting Azure AI Search handle intelligent retrieval and Azure OpenAI handle reasoning, you create an efficient, explainable, and scalable AI assistant. The outcome: Smarter answers, lower costs, and a pipeline that scales with your enterprise. Wiki Link: Tokenization and Chunking IP Link: AI Migration Accelerator1.5KViews4likes1CommentUnlocking Efficient and Secure AI for Android with Foundry Local
The ability to run advanced AI models directly on smartphones is transforming the mobile landscape. Foundry Local for Android simplifies the integration of generative AI models, allowing teams to deliver sophisticated, secure, and low-latency AI experiences natively on mobile devices. This post highlights Foundry Local for Android as a compelling solution for Android developers, helping them efficiently build and deploy powerful on-device AI capabilities within their applications. The Challenges of Deploying AI on Mobile Devices On-device AI offers the promise of offline capabilities, enhanced privacy, and low-latency processing. However, implementing these capabilities on mobile devices introduces several technical obstacles: Limited computing and storage: Mobile devices operate with constrained processing power and storage compared to traditional PCs. Even the most compact language models can occupy significant space and demand substantial computational resources. Efficient solutions for model and runtime optimization are critical for successful deployment. Concerns about the app size: Integrating large AI models and libraries can dramatically increase application size, reducing install rates and degrading other app features. It remains a challenge to provide advanced AI capabilities while keeping the application compact and efficient. Complexity of development and integration: Most mobile development teams are not specialized in machine learning. The process of adapting, optimizing, and deploying models for mobile inference can be resource intensive. Streamlined APIs and pre-optimized models simplify integration and accelerate time to market. Introducing Foundry Local for Android Foundry Local is designed as a comprehensive on-device AI solution, featuring pre-optimized models, a cross-platform inference engine, and intuitive APIs for seamless integration. Initially announced at //Build 2025 with support for Windows and MacOS desktops, Foundry Local now extends its capabilities to Android in private preview. You can sign up for the private preview https://aka.ms/foundrylocal-androidprp for early evaluation and feedback. To meet the demands of production deployments, Foundry Local for Android is architected as a dedicated Android app paired with an SDK. The app manages model distribution, hosts the AI runtime, and operates as a specialized background service. Client applications interface with this service using a lightweight Foundry Local Android SDK, ensuring minimal overhead and streamlined connectivity. One Model, Multiple Apps: Foundry Local centralizes model management, ensuring that if multiple applications utilize the same model in Foundry Local, it is downloaded and stored only once. This approach optimizes storage and streamlines resource usage. Minimal App Footprint: Client applications are freed from embedding bulky machine learning libraries and models. This avoids ballooning app size and memory usage. Run Separately from Client Apps: The Foundry Local operates independently of client applications. Developers benefit from continuous enhancements without the need for frequent app releases. Customer Story: PhonePe PhonePe, one of India's largest consumer payments platforms that enables access to payments and financial services to hundreds of millions of people across the country. With Foundry Local, PhonePe is enabling AI that allows their users to gain deeper insights into their transactions and payments behavior directly on their mobile device. And because inferencing happens locally, all data stays private and secure. This collaboration addresses PhonePe's key priority of delivering an AI experience that upholds privacy. Foundry Local enables PhonePe to differentiate their app experience in a competitive market using AI while ensuring compliance with privacy commitments. Explore their journey here: PhonePe Product Showcase at Microsoft Ignite 2025 Call to Action Foundry Local equips Android apps with on-device AI, supporting the development of smarter applications for the future. Developers are able to build efficient and secure AI capabilities into their apps, even without extensive expertise in artificial intelligence. See more about Foundry Local in action in this episode of Microsoft Mechanics: https://aka.ms/FL_IGNITE_MSMechanics We look forward to seeing you light up AI capabilities in your Android app with Foundry Local. Don’t miss our private preview: https://aka.ms/foundrylocal-androidprp. We appreciate your feedback, as it will help us make our product better. Thanks to the contribution from NimbleEdge which delivers real-time, on-device personalization for millions of mobile devices. NimbleEdge's mobile technology expertise helps Foundry Local deliver a better experience for Android users.346Views0likes0CommentsSecuring Azure AI Applications: A Deep Dive into Emerging Threats | Part 1
Why AI Security Can’t Be Ignored? Generative AI is rapidly reshaping how enterprises operate—accelerating decision-making, enhancing customer experiences, and powering intelligent automation across critical workflows. But as organizations adopt these capabilities at scale, a new challenge emerges: AI introduces security risks that traditional controls cannot fully address. AI models interpret natural language, rely on vast datasets, and behave dynamically. This flexibility enables innovation—but also creates unpredictable attack surfaces that adversaries are actively exploiting. As AI becomes embedded in business-critical operations, securing these systems is no longer optional—it is essential. The New Reality of AI Security The threat landscape surrounding AI is evolving faster than any previous technology wave. Attackers are no longer focused solely on exploiting infrastructure or APIs; they are targeting the intelligence itself—the model, its prompts, and its underlying data. These AI-specific attack vectors can: Expose sensitive or regulated data Trigger unintended or harmful actions Skew decisions made by AI-driven processes Undermine trust in automated systems As AI becomes deeply integrated into customer journeys, operations, and analytics, the impact of these attacks grows exponentially. Why These Threats Matter? Threats such as prompt manipulation and model tampering go beyond technical issues—they strike at the foundational principles of trustworthy AI. They affect: Confidentiality: Preventing accidental or malicious exposure of sensitive data through manipulated prompts. Integrity: Ensuring outputs remain accurate, unbiased, and free from tampering. Reliability: Maintaining consistent model behavior even when adversaries attempt to deceive or mislead the system. When these pillars are compromised, the consequences extend across the business: Incorrect or harmful AI recommendations Regulatory and compliance violations Damage to customer trust Operational and financial risk In regulated sectors, these threats can also impact audit readiness, risk posture, and long-term credibility. Understanding why these risks matter builds the foundation. In the upcoming blogs, we’ll explore how these threats work and practical steps to mitigate them using Azure AI’s security ecosystem. Why AI Security Remains an Evolving Discipline? Traditional security frameworks—built around identity, network boundaries, and application hardening—do not fully address how AI systems operate. Generative models introduce unique and constantly shifting challenges: Dynamic Model Behavior: Models adapt to context and data, creating a fluid and unpredictable attack surface. Natural Language Interfaces: Prompts are unstructured and expressive, making sanitization inherently difficult. Data-Driven Risks: Training and fine-tuning pipelines can be manipulated, poisoned, or misused. Rapidly Emerging Threats: Attack techniques evolve faster than most defensive mechanisms, requiring continuous learning and adaptation. Microsoft and other industry leaders are responding with robust tools—Azure AI Content Safety, Prompt Shields, Responsible AI Frameworks, encryption, isolation patterns—but technology alone cannot eliminate risk. True resilience requires a combination of tooling, governance, awareness, and proactive operational practices. Let's Build a Culture of Vigilance: AI security is not just a technical requirement—it is a strategic business necessity. Effective protection requires collaboration across: Developers Data and AI engineers Cybersecurity teams Cloud platform teams Leadership and governance functions Security for AI is a shared responsibility. Organizations must cultivate awareness, adopt secure design patterns, and continuously monitor for evolving attack techniques. Building this culture of vigilance is critical for long-term success. Key Takeaways: AI brings transformative value, but it also introduces risks that evolve as quickly as the technology itself. Strengthening your AI security posture requires more than robust tooling—it demands responsible AI practices, strong governance, and proactive monitoring. By combining Azure’s built-in security capabilities with disciplined operational practices, organizations can ensure their AI systems remain secure, compliant, and trustworthy, even as new threats emerge. What’s Next? In future blogs, we’ll explore two of the most important AI threats—Prompt Injection and Model Manipulation—and share actionable strategies to mitigate them using Azure AI’s security capabilities. Stay tuned for practical guidance, real-world scenarios, and Microsoft-backed best practices to keep your AI applications secure. Stay Tuned.!680Views3likes0CommentsHybrid AI Using Foundry Local, Microsoft Foundry and the Agent Framework - Part 2
Background In Part 1, we explored how a local LLM running entirely on your GPU can call out to the cloud for advanced capabilities The theme was: Keep your data local. Pull intelligence in only when necessary. That was local-first AI calling cloud agents as needed. This time, the cloud is in charge, and the user interacts with a Microsoft Foundry hosted agent — but whenever it needs private, sensitive, or user-specific information, it securely “calls back home” to a local agent running next to the user via MCP. Think of it as: The cloud agent = specialist doctor The local agent = your health coach who you trust and who knows your medical history The cloud never sees your raw medical history The local agent only shares the minimum amount of information needed to support the cloud agent reasoning This hybrid intelligence pattern respects privacy while still benefiting from hosted frontier-level reasoning. Disclaimer: The diagnostic results, symptom checker, and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment. Architecture Overview The diagram illustrates a hybrid AI workflow where a Microsoft Foundry–hosted agent in Azure works together with a local MCP server running on the user’s machine. The cloud agent receives user symptoms and uses a frontier model (GPT-4.1) for reasoning, but when it needs personal context—like medical history—it securely calls back into the local MCP Health Coach via a dev-tunnel. The local MCP server queries a local GPU-accelerated LLM (Phi-4-mini via Foundry Local) along with stored health-history JSON, returning only the minimal structured background the cloud model needs. The cloud agent then combines both pieces—its own reasoning plus the local context—to produce tailored recommendations, all while sensitive data stays fully on the user’s device. Hosting the agent in Microsoft Foundry on Azure provides a reliable, scalable orchestration layer that integrates directly with Azure identity, monitoring, and governance. It lets you keep your logic, policies, and reasoning engine in the cloud, while still delegating private or resource-intensive tasks to your local environment. This gives you the best of both worlds: enterprise-grade control and flexibility with edge-level privacy and efficiency. Demo Setup Create the Cloud Hosted Agent Using Microsoft Foundry, I created an agent in the UI and pick gpt-4.1 as model: I provided rigorous instructions as system prompt: You are a medical-specialist reasoning assistant for non-emergency triage. You do NOT have access to the patient’s identity or private medical history. A privacy firewall limits what you can see. A local “Personal Health Coach” LLM exists on the user’s device. It holds the patient’s full medical history privately. You may request information from this local model ONLY by calling the MCP tool: get_patient_background(symptoms) This tool returns a privacy-safe, PII-free medical summary, including: - chronic conditions - allergies - medications - relevant risk factors - relevant recent labs - family history relevance - age group Rules: 1. When symptoms are provided, ALWAYS call get_patient_background BEFORE reasoning. 2. NEVER guess or invent medical history — always retrieve it from the local tool. 3. NEVER ask the user for sensitive personal details. The local model handles that. 4. After the tool runs, combine: (a) the patient_background output (b) the user’s reported symptoms to deliver high-level triage guidance. 5. Do not diagnose or prescribe medication. 6. Always end with: “This is not medical advice.” You MUST display the section “Local Health Coach Summary:” containing the JSON returned from the tool before giving your reasoning. Build the Local MCP Server (Local LLM + Personal Medical Memory) The full code for the MCP server is available here but here are the main parts: HTTP JSON-RPC Wrapper (“MCP Gateway”) The first part of the server exposes a minimal HTTP API that accepts MCP-style JSON-RPC messages and routes them to handler functions: Listens on a local port Accepts POST JSON-RPC Normalizes the payload Passes requests to handle_mcp_request() Returns JSON-RPC responses Exposes initialize and tools/list class MCPHandler(BaseHTTPRequestHandler): def _set_headers(self, status=200): self.send_response(status) self.send_header("Content-Type", "application/json") self.end_headers() def do_GET(self): self._set_headers() self.wfile.write(b"OK") def do_POST(self): content_len = int(self.headers.get("Content-Length", 0)) raw = self.rfile.read(content_len) print("---- RAW BODY ----") print(raw) print("-------------------") try: req = json.loads(raw.decode("utf-8")) except: self._set_headers(400) self.wfile.write(b'{"error":"Invalid JSON"}') return resp = handle_mcp_request(req) self._set_headers() self.wfile.write(json.dumps(resp).encode("utf-8")) Tool Definition: get_patient_background This section defines the tool contract exposed to Azure AI Foundry. The hosted agent sees this tool exactly as if it were a cloud function: Advertises the tool via tools/list Accepts arguments passed from the cloud agent Delegates local reasoning to the GPU LLM Returns structured JSON back to the cloud def handle_mcp_request(req): method = req.get("method") req_id = req.get("id") if method == "tools/list": return { "jsonrpc": "2.0", "id": req_id, "result": { "tools": [ { "name": "get_patient_background", "description": "Returns anonymized personal medical context using your local LLM.", "inputSchema": { "type": "object", "properties": { "symptoms": {"type": "string"} }, "required": ["symptoms"] } } ] } } if method == "tools/call": tool = req["params"]["name"] args = req["params"]["arguments"] if tool == "get_patient_background": symptoms = args.get("symptoms", "") summary = summarize_patient_locally(symptoms) return { "jsonrpc": "2.0", "id": req_id, "result": { "content": [ { "type": "text", "text": json.dumps(summary) } ] } } Local GPU LLM Caller (Foundry Local Integration) This is where personalization happens — entirely on your machine, not in the cloud: Calls the local GPU LLM through Foundry Local Injects private medical data (loaded from a file or memory) Produces anonymized structured outputs Logs debug info so you can see when local inference is running FOUNDRY_LOCAL_BASE_URL = "http://127.0.0.1:52403" FOUNDRY_LOCAL_CHAT_URL = f"{FOUNDRY_LOCAL_BASE_URL}/v1/chat/completions" FOUNDRY_LOCAL_MODEL_ID = "Phi-4-mini-instruct-cuda-gpu:5" def summarize_patient_locally(symptoms: str): print("[LOCAL] Calling Foundry Local GPU model...") payload = { "model": FOUNDRY_LOCAL_MODEL_ID, "messages": [ {"role": "system", "content": PERSONAL_SYSTEM_PROMPT}, {"role": "user", "content": symptoms} ], "max_tokens": 300, "temperature": 0.1 } resp = requests.post( FOUNDRY_LOCAL_CHAT_URL, headers={"Content-Type": "application/json"}, data=json.dumps(payload), timeout=60 ) llm_content = resp.json()["choices"][0]["message"]["content"] print("[LOCAL] Raw content:\n", llm_content) cleaned = _strip_code_fences(llm_content) return json.loads(cleaned) Start a Dev Tunnel Now we need to do some plumbing work to make sure the cloud can resolve the MCP endpoint. I used Azure Dev Tunnels for this demo. The snippet below shows how to set that up in 4 PowerShell commands: PS C:\Windows\system32> winget install Microsoft.DevTunnel PS C:\Windows\system32> devtunnel create mcp-health PS C:\Windows\system32> devtunnel port create mcp-health -p 8081 --protocol http PS C:\Windows\system32> devtunnel host mcp-health I have now a public endpoint: https://xxxxxxxxx.devtunnels.ms:8081 Wire Everything Together in Azure AI Foundry Now let's us the UI to add a new custom tool as MCP for our agent: And point to the public endpoint created previously: Voila, we're done with the setup, let's test it Demo: The Cloud Agent Talks to Your Local Private LLM I am going to use a simple prompt in the agent: “Hi, I’ve been feeling feverish, fatigued, and a bit short of breath since yesterday. Should I be worried?” Disclaimer: The diagnostic results and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment. Below is the sequence shown in real time: Conclusion — Why This Hybrid Pattern Matters Hybrid AI lets you place intelligence exactly where it belongs: high-value reasoning in the cloud, sensitive or contextual data on the local machine. This protects privacy while reducing cloud compute costs—routine lookups, context gathering, and personal history retrieval can all run on lightweight local models instead of expensive frontier models. This pattern also unlocks powerful real-world applications: local financial data paired with cloud financial analysis, on-device coding knowledge combined with cloud-scale refactoring, or local corporate context augmenting cloud automation agents. In industrial and edge environments, local agents can sit directly next to the action—embedded in factory sensors, cameras, kiosks, or ambient IoT devices—providing instant, private intelligence while the cloud handles complex decision-making. Hybrid AI turns every device into an intelligent collaborator, and every cloud agent into a specialist that can safely leverage local expertise. References Get started with Foundry Local - Foundry Local: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/get-started?view=foundry-classic Using MCP tools with Agents (Microsoft Agent Framework) — https://learn.microsoft.com/en-us/agent-framework/user-guide/model-context-protocol/using-mcp-tools Microsoft Learn Build Agents using Model Context Protocol on Azure — https://learn.microsoft.com/en-us/azure/developer/ai/intro-agents-mcp Microsoft Learn Full demo repo available here.643Views1like0CommentsIntroducing Microsoft Agent Factory
Microsoft Agent Factory is a new program designed for organizations that want to move from experimentation to execution faster. With a single plan, organizations can build agents with Work IQ, Fabric IQ, and Foundry IQ using Microsoft Foundry and Copilot Studio. They can also deploy their agents anywhere, including Microsoft 365 Copilot, with no upfront licensing and provisioning required. Eligible organizations can also tap into hands-on engagement from top AI Forward Deployed Engineers (FDEs) and access tailored role-based training to boost AI fluency across teams.23KViews11likes0CommentsHybrid AI Using Foundry Local, Microsoft Foundry and the Agent Framework - Part 1
Hybrid AI is quickly becoming one of the most practical architectures for real-world applications—especially when privacy, compliance, or sensitive data handling matter. Today, it’s increasingly common for users to have capable GPUs in their laptops or desktops, and the ecosystem of small, efficient open-source language models has grown dramatically. That makes local inference not only possible, but easy. In this guide, we explore how a locally run agent built with the Agent Framework can combine the strengths of cloud models in Azure AI Foundry with a local LLM running on your own GPU through Foundry Local. This pattern allows you to use powerful cloud reasoning without ever sending raw sensitive data—like medical labs, legal documents, or financial statements—off the device. Part 1 focuses on the foundations of this architecture, using a simple illustrative example to show how local and cloud inference can work together seamlessly under a single agent. Disclaimer: The diagnostic results, symptom checker, and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment. Demonstrating the concept Problem Statement We’ve all done it: something feels off, we get a strange symptom, or a lab report pops into our inbox—and before thinking twice, we copy-paste way too much personal information into whatever website or chatbot seems helpful at the moment. Names, dates of birth, addresses, lab values, clinic details… all shared out of habit, usually because we just want answers quickly. This guide uses a simple, illustrative scenario—a symptom checker with lab report summarization—to show how hybrid AI can help reduce that oversharing. It’s not a medical product or a clinical solution, but it’s a great way to understand the pattern. With Microsoft Foundry, Foundry Local, and the Agent Framework, we can build workflows where sensitive data stays on the user’s machine and is processed locally, while the cloud handles the heavier reasoning. Only a safe, structured summary ever leaves the device. The Agent Framework handles when to use the local model vs. the cloud model, giving us a seamless and privacy-preserving hybrid experience. Demo scenario This demo uses a simple, illustrative symptom-checker to show how hybrid AI keeps sensitive data private while still benefiting from powerful cloud reasoning. It’s not a medical product—just an easy way to demonstrate the pattern: Here’s what happens: A Python agent (Agent Framework) runs locally and can call both cloud models and local tools. Azure AI Foundry (GPT-4o) handles reasoning and triage logic but never sees raw PHI. Foundry Local runs a small LLM (phi-4-mini) on your GPU and processes the raw lab report entirely on-device. A tool function (@ai_function) lets the agent call the local model automatically when it detects lab-like text. The flow is simple: user_message = symptoms + raw lab text agent → calls local tool → local LLM returns JSON cloud LLM → uses JSON to produce guidance Environment setup Foundry Local Service On the local machine with GPU, let's install Foundry local using: PS C: \Windows\system32> winget install Microsoft.FoundryLocal Then let's download our local model, in this case phi-4-mini and test it: PS C:\Windows\system32> foundry model download phi-4-mini Downloading Phi-4-mini-instruct-cuda-gpu:5... [################### ] 53.59 % [Time remaining: about 4m] 5.9 MB/s/s PS C:\Windows\system32> foundry model load phi-4-mini 🕗 Loading model... 🟢 Model phi-4-mini loaded successfully PS C:\Windows\system32> foundry model run phi-4-mini Model Phi-4-mini-instruct-cuda-gpu:5 was found in the local cache. Interactive Chat. Enter /? or /help for help. Press Ctrl+C to cancel generation. Type /exit to leave the chat. Interactive mode, please enter your prompt > Hello can you let me know who you are and which model you are using 🧠 Thinking... 🤖 Hello! I'm Phi, an AI developed by Microsoft. I'm here to help you with any questions or tasks you have. How can I assist you today? > PS C:\Windows\system32> foundry service status 🟢 Model management service is running on http://127.0.0.1:52403/openai/status Now we see the model is accessible with API on the localhost with port 52403. Foundry Local models don’t always use simple names like "phi-4-mini". Each installed model has a specific Model ID that Foundry Local assigns (for example: Phi-4-mini-instruct-cuda-gpu:5 in this case). We now can use the Model ID for a quick test: from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:52403/v1", api_key="ignored") resp = client.chat.completions.create( model="Phi-4-mini-instruct-cuda-gpu:5", messages=[{"role": "user", "content": "Say hello"}]) Returned 200 OK. Microsoft Foundry To handle the cloud part of the hybrid workflow, we start by creating a Microsoft AI Foundry project. This gives us an easy, managed way to use models like GPT-4o-mini —no deployment steps, no servers to configure. You simply point the Agent Framework at your project, authenticate, and you’re ready to call the model. A nice benefit is that Microsoft Foundry and Foundry Local share the same style of API. Whether you call a model in the cloud or on your own machine, the request looks almost identical. This consistency makes hybrid development much easier: the agent doesn’t need different logic for local vs. cloud models—it just switches between them when needed. Under the Hood of Our Hybrid AI Workflow Agent Framework For the agent code, I am using the Agent Framework libraries, and I am giving specific instructions to the agent as per below: from agent_framework import ChatAgent, ai_function from agent_framework.azure import AzureAIAgentClient from azure.identity.aio import AzureCliCredential # ========= Cloud Symptom Checker Instructions ========= SYMPTOM_CHECKER_INSTRUCTIONS = """ You are a careful symptom-checker assistant for non-emergency triage. General behavior: - You are NOT a clinician. Do NOT provide medical diagnosis or prescribe treatment. - First, check for red-flag symptoms (e.g., chest pain, trouble breathing, severe bleeding, stroke signs, one-sided weakness, confusion, fainting). If any are present, advise urgent/emergency care and STOP. - If no red-flags, summarize key factors (age group, duration, severity), then provide: 1) sensible next steps a layperson could take, 2) clear guidance on when to contact a clinician, 3) simple self-care advice if appropriate. - Use plain language, under 8 bullets total. - Always end with: "This is not medical advice." Tool usage: - When the user provides raw lab report text, or mentions “labs below” or “see labs”, you MUST call the `summarize_lab_report` tool to convert the labs into structured data before giving your triage guidance. - Use the tool result as context, but do NOT expose the raw JSON directly. Instead, summarize the key abnormal findings in plain language. """.strip() Referencing the local model Now I am providing a system prompt for the locally inferred model to transform the lab result text into a JSON object with lab results only: # ========= Local Lab Summarizer (Foundry Local + Phi-4-mini) ========= FOUNDRY_LOCAL_BASE = "http://127.0.0.1:52403" # from `foundry service status` FOUNDRY_LOCAL_CHAT_URL = FOUNDRY_LOCAL_BASE + "/v1/chat/completions" # This is the model id you confirmed works: FOUNDRY_LOCAL_MODEL_ID = "Phi-4-mini-instruct-cuda-gpu:5" LOCAL_LAB_SYSTEM_PROMPT = """ You are a medical lab report summarizer running locally on the user's machine. You MUST respond with ONLY one valid JSON object. Do not include any explanation, backticks, markdown, or text outside the JSON. The JSON must have this shape: { "overall_assessment": "<short plain English summary>", "notable_abnormal_results": [ { "test": "string", "value": "string", "unit": "string or null", "reference_range": "string or null", "severity": "mild|moderate|severe" } ] } If you are unsure about a field, use null. Do NOT invent values. """.strip() Agent Framework tool In this next step, we wrap the local Foundry inference inside an Agent Framework tool using the AI_function decorator. This abstraction is more than styler—it is the recommended best practice for hybrid architectures. By exposing local GPU inference as a tool, the cloud-hosted agent can decide when to call it, pass structured arguments, and consume the returned JSON seamlessly. It also ensures that the raw lab text (which may contain PII) stays strictly within the local function boundary, never entering the cloud conversation. Using a tool in this way provides a consistent, declarative interface, enables automatic reasoning and tool-routing by frontier models, and keeps the entire hybrid workflow maintainable, testable, and secure: @ai_function( name="summarize_lab_report", description=( "Summarize a raw lab report into structured abnormalities using a local model " "running on the user's GPU. Use this whenever the user provides lab results as text." ), ) def summarize_lab_report( lab_text: Annotated[str, Field(description="The raw text of the lab report to summarize.")], ) -> Dict[str, Any]: """ Tool: summarize a lab report using Foundry Local (Phi-4-mini) on the user's GPU. Returns a JSON-compatible dict with: - overall_assessment: short text summary - notable_abnormal_results: list of abnormal test objects """ payload = { "model": FOUNDRY_LOCAL_MODEL_ID, "messages": [ {"role": "system", "content": LOCAL_LAB_SYSTEM_PROMPT}, {"role": "user", "content": lab_text}, ], "max_tokens": 256, "temperature": 0.2, } headers = { "Content-Type": "application/json", } print(f"[LOCAL TOOL] POST {FOUNDRY_LOCAL_CHAT_URL}") resp = requests.post( FOUNDRY_LOCAL_CHAT_URL, headers=headers, data=json.dumps(payload), timeout=120, ) resp.raise_for_status() data = resp.json() # OpenAI-compatible shape: choices[0].message.content content = data["choices"][0]["message"]["content"] # Handle string vs list-of-parts if isinstance(content, list): content_text = "".join( part.get("text", "") for part in content if isinstance(part, dict) ) else: content_text = content print("[LOCAL TOOL] Raw content from model:") print(content_text) # Strip ```json fences if present, then parse JSON cleaned = _strip_code_fences(content_text) lab_summary = json.loads(cleaned) print("[LOCAL TOOL] Parsed lab summary JSON:") print(json.dumps(lab_summary, indent=2)) # Return dict – Agent Framework will serialize this as the tool result return lab_summary The case, labs and prompt All patient and provider information in below example is entirely fictitious and used for illustrative purposes only. To illustrate the pattern, this sample prepares the “case” in code: it combines a symptom description with a lab report string and then submits that prompt to the agent. In production, these inputs would be captured from a UI or API. # Example free-text case + raw lab text that the agent can decide to send to the tool case = ( "Teenager with bad headache and throwing up. Fever of 40C and no other symptoms." ) lab_report_text = """ ------------------------------------------- AI Land FAMILY LABORATORY SERVICES 4420 Camino Del Foundry, Suite 210 Gpuville, CA 92108 Phone: (123) 555-4821 | Fax: (123) 555-4822 ------------------------------------------- PATIENT INFORMATION Name: Frontier Model DOB: 04/12/2007 (17 yrs) Sex: Male Patient ID: AXT-442871 Address: 1921 MCP Court, CA 01100 ORDERING PROVIDER Dr. Bot, MD NPI: 1780952216 Clinic: Phi Pediatrics Group REPORT DETAILS Accession #: 24-SDFLS-118392 Collected: 11/14/2025 14:32 Received: 11/14/2025 16:06 Reported: 11/14/2025 20:54 Specimen: Whole Blood (EDTA), Serum Separator Tube ------------------------------------------------------ COMPLETE BLOOD COUNT (CBC) ------------------------------------------------------ WBC ................. 14.5 x10^3/µL (4.0 – 10.0) HIGH RBC ................. 4.61 x10^6/µL (4.50 – 5.90) Hemoglobin .......... 13.2 g/dL (13.0 – 17.5) LOW-NORMAL Hematocrit .......... 39.8 % (40.0 – 52.0) LOW MCV ................. 86.4 fL (80 – 100) Platelets ........... 210 x10^3/µL (150 – 400) ------------------------------------------------------ INFLAMMATORY MARKERS ------------------------------------------------------ C-Reactive Protein (CRP) ......... 60 mg/L (< 5 mg/L) HIGH Erythrocyte Sedimentation Rate ... 32 mm/hr (0 – 15 mm/hr) HIGH ------------------------------------------------------ BASIC METABOLIC PANEL (BMP) ------------------------------------------------------ Sodium (Na) .............. 138 mmol/L (135 – 145) Potassium (K) ............ 3.9 mmol/L (3.5 – 5.1) Chloride (Cl) ............ 102 mmol/L (98 – 107) CO2 (Bicarbonate) ........ 23 mmol/L (22 – 29) Blood Urea Nitrogen (BUN) 11 mg/dL (7 – 20) Creatinine ................ 0.74 mg/dL (0.50 – 1.00) Glucose (fasting) ......... 109 mg/dL (70 – 99) HIGH ------------------------------------------------------ LIVER FUNCTION TESTS ------------------------------------------------------ AST ....................... 28 U/L (0 – 40) ALT ....................... 22 U/L (0 – 44) Alkaline Phosphatase ...... 144 U/L (65 – 260) Total Bilirubin ........... 0.6 mg/dL (0.1 – 1.2) ------------------------------------------------------ NOTES ------------------------------------------------------ Mild leukocytosis and elevated inflammatory markers (CRP, ESR) may indicate an acute infectious or inflammatory process. Glucose slightly elevated; could be non-fasting. ------------------------------------------------------ END OF REPORT SDFLS-CLIA ID: 05D5554973 This report is for informational purposes only and not a diagnosis. ------------------------------------------------------ """ # Single user message that gives both the case and labs. # The agent will see that there are labs and call summarize_lab_report() as a tool. user_message = ( "Patient case:\n" f"{case}\n\n" "Here are the lab results as raw text. If helpful, you can summarize them first:\n" f"{lab_report_text}\n\n" "Please provide non-emergency triage guidance." ) The Hybrid Agent code Here’s where the hybrid behavior actually comes together. By this point, we’ve defined a local tool that talks to Foundry Local and configured access to a cloud model in Azure AI Foundry. In the main() function, the Agent Framework ties these pieces into a single workflow. The agent runs locally, receives a message containing both symptoms and a raw lab report, and decides when to call the local tool. The lab report is summarized on your GPU, and only the structured JSON is passed to the cloud model for reasoning. The snippet below shows how we attach the tool to the agent and trigger both local inference and cloud guidance within one natural-language prompt # ========= Hybrid Main (Agent uses the local tool) ========= async def main(): ... async with ( AzureCliCredential() as credential, ChatAgent( chat_client=AzureAIAgentClient(async_credential=credential), instructions=SYMPTOM_CHECKER_INSTRUCTIONS, # 👇 Tool is now attached to the agent tools=[summarize_lab_report], name="hybrid-symptom-checker", ) as agent, ): result = await agent.run(user_message) print("\n=== Symptom Checker (Hybrid: Local Tool + Cloud Agent) ===\n") print(result.text) if __name__ == "__main__": asyncio.run(main()) Testing the Hybrid Agent Now I am running the agent code from VSCode and can see the local inference happening when lab was submitted. Then results are formatted, PII omitted and the GPT-40 model can process the symptom along the results What's next In this example, the agent runs locally and pulls in both cloud and local inference. In Part 2, we’ll explore the opposite architecture: a cloud-hosted agent that can safely call back into a local LLM through a secure gateway. This opens the door to more advanced hybrid patterns where tools running on edge devices, desktops, or on-prem systems can participate in cloud-driven workflows without exposing sensitive data. References Agent Framework: https://github.com/microsoft/agent-framework Repo for the code available here:1.2KViews2likes0Comments