Microsoft Foundry Blog

7 MIN READ

Hybrid AI Using Foundry Local, Microsoft Foundry and the Agent Framework - Part 2

OlivierB123

Microsoft

Nov 21, 2025

Building a Cloud Agent That Thinks Locally: A Hybrid AI Pattern for Privacy-Preserving Reasoning

Background

In Part 1, we explored how a local LLM running entirely on your GPU can call out to the cloud for advanced capabilities The theme was: Keep your data local. Pull intelligence in only when necessary. That was local-first AI calling cloud agents as needed.

This time, the cloud is in charge, and the user interacts with a Microsoft Foundry hosted agent — but whenever it needs private, sensitive, or user-specific information, it securely “calls back home” to a local agent running next to the user via MCP.

Think of it as:

The cloud agent = specialist doctor
The local agent = your health coach who you trust and who knows your medical history
The cloud never sees your raw medical history
The local agent only shares the minimum amount of information needed to support the cloud agent reasoning

This hybrid intelligence pattern respects privacy while still benefiting from hosted frontier-level reasoning.

Disclaimer:
The diagnostic results, symptom checker, and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment.

Architecture Overview

The diagram illustrates a hybrid AI workflow where a Microsoft Foundry–hosted agent in Azure works together with a local MCP server running on the user’s machine. The cloud agent receives user symptoms and uses a frontier model (GPT-4.1) for reasoning, but when it needs personal context—like medical history—it securely calls back into the local MCP Health Coach via a dev-tunnel. The local MCP server queries a local GPU-accelerated LLM (Phi-4-mini via Foundry Local) along with stored health-history JSON, returning only the minimal structured background the cloud model needs. The cloud agent then combines both pieces—its own reasoning plus the local context—to produce tailored recommendations, all while sensitive data stays fully on the user’s device.

Hosting the agent in Microsoft Foundry on Azure provides a reliable, scalable orchestration layer that integrates directly with Azure identity, monitoring, and governance. It lets you keep your logic, policies, and reasoning engine in the cloud, while still delegating private or resource-intensive tasks to your local environment. This gives you the best of both worlds: enterprise-grade control and flexibility with edge-level privacy and efficiency.

Demo Setup

Create the Cloud Hosted Agent

Using Microsoft Foundry, I created an agent in the UI and pick gpt-4.1 as model:

I provided rigorous instructions as system prompt:

You are a medical-specialist reasoning assistant for non-emergency triage.  
You do NOT have access to the patient’s identity or private medical history.  
A privacy firewall limits what you can see.

A local “Personal Health Coach” LLM exists on the user’s device.  
It holds the patient’s full medical history privately.

You may request information from this local model ONLY by calling the MCP tool:
   get_patient_background(symptoms)

This tool returns a privacy-safe, PII-free medical summary, including:
- chronic conditions  
- allergies  
- medications  
- relevant risk factors  
- relevant recent labs  
- family history relevance  
- age group  

Rules:
1. When symptoms are provided, ALWAYS call get_patient_background BEFORE reasoning.
2. NEVER guess or invent medical history — always retrieve it from the local tool.
3. NEVER ask the user for sensitive personal details. The local model handles that.
4. After the tool runs, combine:
      (a) the patient_background output  
      (b) the user’s reported symptoms  
   to deliver high-level triage guidance.
5. Do not diagnose or prescribe medication.
6. Always end with: “This is not medical advice.”

You MUST display the section “Local Health Coach Summary:” containing the JSON returned from the tool before giving your reasoning.

Build the Local MCP Server (Local LLM + Personal Medical Memory)

The full code for the MCP server is available here but here are the main parts:

HTTP JSON-RPC Wrapper (“MCP Gateway”)

The first part of the server exposes a minimal HTTP API that accepts MCP-style JSON-RPC messages and routes them to handler functions:

Listens on a local port
Accepts POST JSON-RPC
Normalizes the payload
Passes requests to handle_mcp_request()
Returns JSON-RPC responses
Exposes initialize and tools/list

class MCPHandler(BaseHTTPRequestHandler):
    def _set_headers(self, status=200):
        self.send_response(status)
        self.send_header("Content-Type", "application/json")
        self.end_headers()

    def do_GET(self):
        self._set_headers()
        self.wfile.write(b"OK")

    def do_POST(self):
        content_len = int(self.headers.get("Content-Length", 0))
        raw = self.rfile.read(content_len)

        print("---- RAW BODY ----")
        print(raw)
        print("-------------------")

        try:
            req = json.loads(raw.decode("utf-8"))
        except:
            self._set_headers(400)
            self.wfile.write(b'{"error":"Invalid JSON"}')
            return

        resp = handle_mcp_request(req)
        self._set_headers()
        self.wfile.write(json.dumps(resp).encode("utf-8"))

Tool Definition: get_patient_background

This section defines the tool contract exposed to Azure AI Foundry. The hosted agent sees this tool exactly as if it were a cloud function:

Advertises the tool via tools/list
Accepts arguments passed from the cloud agent
Delegates local reasoning to the GPU LLM
Returns structured JSON back to the cloud

def handle_mcp_request(req):
    method = req.get("method")
    req_id = req.get("id")

    if method == "tools/list":
        return {
            "jsonrpc": "2.0",
            "id": req_id,
            "result": {
                "tools": [
                    {
                        "name": "get_patient_background",
                        "description": "Returns anonymized personal medical context using your local LLM.",
                        "inputSchema": {
                            "type": "object",
                            "properties": { "symptoms": {"type": "string"} },
                            "required": ["symptoms"]
                        }
                    }
                ]
            }
        }

    if method == "tools/call":
        tool = req["params"]["name"]
        args = req["params"]["arguments"]

        if tool == "get_patient_background":
            symptoms = args.get("symptoms", "")
            summary = summarize_patient_locally(symptoms)

            return {
                "jsonrpc": "2.0",
                "id": req_id,
                "result": {
                    "content": [
                        {
                            "type": "text",
                            "text": json.dumps(summary)
                        }
                    ]
                }
            }

Local GPU LLM Caller (Foundry Local Integration)

This is where personalization happens — entirely on your machine, not in the cloud:

Calls the local GPU LLM through Foundry Local
Injects private medical data (loaded from a file or memory)
Produces anonymized structured outputs
Logs debug info so you can see when local inference is running

FOUNDRY_LOCAL_BASE_URL = "http://127.0.0.1:52403"
FOUNDRY_LOCAL_CHAT_URL = f"{FOUNDRY_LOCAL_BASE_URL}/v1/chat/completions"
FOUNDRY_LOCAL_MODEL_ID = "Phi-4-mini-instruct-cuda-gpu:5"

def summarize_patient_locally(symptoms: str):
    print("[LOCAL] Calling Foundry Local GPU model...")

    payload = {
        "model": FOUNDRY_LOCAL_MODEL_ID,
        "messages": [
            {"role": "system", "content": PERSONAL_SYSTEM_PROMPT},
            {"role": "user", "content": symptoms}
        ],
        "max_tokens": 300,
        "temperature": 0.1
    }

    resp = requests.post(
        FOUNDRY_LOCAL_CHAT_URL,
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=60
    )

    llm_content = resp.json()["choices"][0]["message"]["content"]
    print("[LOCAL] Raw content:\n", llm_content)

    cleaned = _strip_code_fences(llm_content)
    return json.loads(cleaned)

Start a Dev Tunnel

Now we need to do some plumbing work to make sure the cloud can resolve the MCP endpoint. I used Azure Dev Tunnels for this demo.

The snippet below shows how to set that up in 4 PowerShell commands:

PS C:\Windows\system32> winget install Microsoft.DevTunnel 

PS C:\Windows\system32> devtunnel create mcp-health

PS C:\Windows\system32> devtunnel port create mcp-health -p 8081 --protocol http

PS C:\Windows\system32> devtunnel host mcp-health

I have now a public endpoint:

https://xxxxxxxxx.devtunnels.ms:8081

Wire Everything Together in Azure AI Foundry

Now let's us the UI to add a new custom tool as MCP for our agent:

And point to the public endpoint created previously:

Voila, we're done with the setup, let's test it

Demo: The Cloud Agent Talks to Your Local Private LLM

I am going to use a simple prompt in the agent: “Hi, I’ve been feeling feverish, fatigued, and a bit short of breath since yesterday. Should I be worried?”

Disclaimer:
The diagnostic results and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment.

Below is the sequence shown in real time:

Conclusion — Why This Hybrid Pattern Matters

Hybrid AI lets you place intelligence exactly where it belongs: high-value reasoning in the cloud, sensitive or contextual data on the local machine. This protects privacy while reducing cloud compute costs—routine lookups, context gathering, and personal history retrieval can all run on lightweight local models instead of expensive frontier models.

This pattern also unlocks powerful real-world applications: local financial data paired with cloud financial analysis, on-device coding knowledge combined with cloud-scale refactoring, or local corporate context augmenting cloud automation agents. In industrial and edge environments, local agents can sit directly next to the action—embedded in factory sensors, cameras, kiosks, or ambient IoT devices—providing instant, private intelligence while the cloud handles complex decision-making.

Hybrid AI turns every device into an intelligent collaborator, and every cloud agent into a specialist that can safely leverage local expertise.