Microsoft Foundry Blog

10 MIN READ

Hybrid AI Using Foundry Local, Microsoft Foundry and the Agent Framework - Part 1

OlivierB123

Microsoft

Nov 20, 2025

How Hybrid AI Unlocks Privacy-Preserving Solutions for Regulated Scenarios

Hybrid AI is quickly becoming one of the most practical architectures for real-world applications—especially when privacy, compliance, or sensitive data handling matter. Today, it’s increasingly common for users to have capable GPUs in their laptops or desktops, and the ecosystem of small, efficient open-source language models has grown dramatically. That makes local inference not only possible, but easy.

In this guide, we explore how a locally run agent built with the Agent Framework can combine the strengths of cloud models in Azure AI Foundry with a local LLM running on your own GPU through Foundry Local. This pattern allows you to use powerful cloud reasoning without ever sending raw sensitive data—like medical labs, legal documents, or financial statements—off the device.

Part 1 focuses on the foundations of this architecture, using a simple illustrative example to show how local and cloud inference can work together seamlessly under a single agent.

Disclaimer:
The diagnostic results, symptom checker, and any medical guidance provided in this article are for illustrative and informational purposes only. They are not intended to provide medical advice, diagnosis, or treatment.

Demonstrating the concept

Problem Statement

We’ve all done it: something feels off, we get a strange symptom, or a lab report pops into our inbox—and before thinking twice, we copy-paste way too much personal information into whatever website or chatbot seems helpful at the moment.

Names, dates of birth, addresses, lab values, clinic details… all shared out of habit, usually because we just want answers quickly.

This guide uses a simple, illustrative scenario—a symptom checker with lab report summarization—to show how hybrid AI can help reduce that oversharing. It’s not a medical product or a clinical solution, but it’s a great way to understand the pattern.

With Microsoft Foundry, Foundry Local, and the Agent Framework, we can build workflows where sensitive data stays on the user’s machine and is processed locally, while the cloud handles the heavier reasoning. Only a safe, structured summary ever leaves the device. The Agent Framework handles when to use the local model vs. the cloud model, giving us a seamless and privacy-preserving hybrid experience.

Demo scenario

This demo uses a simple, illustrative symptom-checker to show how hybrid AI keeps sensitive data private while still benefiting from powerful cloud reasoning. It’s not a medical product—just an easy way to demonstrate the pattern:

Here’s what happens:

A Python agent (Agent Framework) runs locally and can call both cloud models and local tools.
Azure AI Foundry (GPT-4o) handles reasoning and triage logic but never sees raw PHI.
Foundry Local runs a small LLM (phi-4-mini) on your GPU and processes the raw lab report entirely on-device.
A tool function (@ai_function) lets the agent call the local model automatically when it detects lab-like text.
The flow is simple:

user_message = symptoms + raw lab text
agent → calls local tool → local LLM returns JSON
cloud LLM → uses JSON to produce guidance

Environment setup

Foundry Local Service

On the local machine with GPU, let's install Foundry local using:

PS C: \Windows\system32> winget install Microsoft.FoundryLocal

Then let's download our local model, in this case phi-4-mini and test it:

PS C:\Windows\system32> foundry model download phi-4-mini 
Downloading Phi-4-mini-instruct-cuda-gpu:5... [################### ] 53.59 % [Time remaining: about 4m] 5.9 MB/s/s 
PS C:\Windows\system32> foundry model load phi-4-mini 
🕗 Loading model... 
🟢 Model phi-4-mini loaded successfully 
PS C:\Windows\system32> foundry model run phi-4-mini 
Model Phi-4-mini-instruct-cuda-gpu:5 was found in the local cache. Interactive Chat. Enter /? or /help for help. Press Ctrl+C to cancel generation. 
Type /exit to leave the chat. Interactive mode, please enter your prompt 
> Hello can you let me know who you are and which model you are using 
🧠 Thinking... 
🤖 Hello! I'm Phi, an AI developed by Microsoft. I'm here to help you with any questions or tasks you have. How can I assist you today? >
PS C:\Windows\system32> foundry service status 🟢 Model management service is running on http://127.0.0.1:52403/openai/status

Now we see the model is accessible with API on the localhost with port 52403. Foundry Local models don’t always use simple names like "phi-4-mini". Each installed model has a specific Model ID that Foundry Local assigns (for example: Phi-4-mini-instruct-cuda-gpu:5 in this case).

We now can use the Model ID for a quick test:

from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:52403/v1", api_key="ignored") 
resp = client.chat.completions.create( model="Phi-4-mini-instruct-cuda-gpu:5", messages=[{"role": "user", "content": "Say hello"}])

Returned 200 OK.

Microsoft Foundry

To handle the cloud part of the hybrid workflow, we start by creating a Microsoft AI Foundry project. This gives us an easy, managed way to use models like GPT-4o-mini —no deployment steps, no servers to configure. You simply point the Agent Framework at your project, authenticate, and you’re ready to call the model.

A nice benefit is that Microsoft Foundry and Foundry Local share the same style of API. Whether you call a model in the cloud or on your own machine, the request looks almost identical. This consistency makes hybrid development much easier: the agent doesn’t need different logic for local vs. cloud models—it just switches between them when needed.

Under the Hood of Our Hybrid AI Workflow

Agent Framework

For the agent code, I am using the Agent Framework libraries, and I am giving specific instructions to the agent as per below:

from agent_framework import ChatAgent, ai_function
from agent_framework.azure import AzureAIAgentClient
from azure.identity.aio import AzureCliCredential


# ========= Cloud Symptom Checker Instructions =========

SYMPTOM_CHECKER_INSTRUCTIONS = """
You are a careful symptom-checker assistant for non-emergency triage.

General behavior:
- You are NOT a clinician. Do NOT provide medical diagnosis or prescribe treatment.
- First, check for red-flag symptoms (e.g., chest pain, trouble breathing, severe bleeding, stroke signs,
  one-sided weakness, confusion, fainting). If any are present, advise urgent/emergency care and STOP.
- If no red-flags, summarize key factors (age group, duration, severity), then provide:
  1) sensible next steps a layperson could take,
  2) clear guidance on when to contact a clinician,
  3) simple self-care advice if appropriate.
- Use plain language, under 8 bullets total.
- Always end with: "This is not medical advice."

Tool usage:
- When the user provides raw lab report text, or mentions “labs below” or “see labs”, 
  you MUST call the `summarize_lab_report` tool to convert the labs into structured data
  before giving your triage guidance.
- Use the tool result as context, but do NOT expose the raw JSON directly. 
  Instead, summarize the key abnormal findings in plain language.
""".strip()

Referencing the local model

Now I am providing a system prompt for the locally inferred model to transform the lab result text into a JSON object with lab results only:

# ========= Local Lab Summarizer (Foundry Local + Phi-4-mini) =========

FOUNDRY_LOCAL_BASE = "http://127.0.0.1:52403"      # from `foundry service status`
FOUNDRY_LOCAL_CHAT_URL = FOUNDRY_LOCAL_BASE + "/v1/chat/completions"

# This is the model id you confirmed works:
FOUNDRY_LOCAL_MODEL_ID = "Phi-4-mini-instruct-cuda-gpu:5"


LOCAL_LAB_SYSTEM_PROMPT = """
You are a medical lab report summarizer running locally on the user's machine.

You MUST respond with ONLY one valid JSON object. Do not include any explanation,
backticks, markdown, or text outside the JSON. The JSON must have this shape:

{
  "overall_assessment": "<short plain English summary>",
  "notable_abnormal_results": [
    {
      "test": "string",
      "value": "string",
      "unit": "string or null",
      "reference_range": "string or null",
      "severity": "mild|moderate|severe"
    }
  ]
}

If you are unsure about a field, use null. Do NOT invent values.
""".strip()

Agent Framework tool

In this next step, we wrap the local Foundry inference inside an Agent Framework tool using the AI_function decorator. This abstraction is more than styler—it is the recommended best practice for hybrid architectures. By exposing local GPU inference as a tool, the cloud-hosted agent can decide when to call it, pass structured arguments, and consume the returned JSON seamlessly. It also ensures that the raw lab text (which may contain PII) stays strictly within the local function boundary, never entering the cloud conversation. Using a tool in this way provides a consistent, declarative interface, enables automatic reasoning and tool-routing by frontier models, and keeps the entire hybrid workflow maintainable, testable, and secure:

@ai_function(
    name="summarize_lab_report",
    description=(
        "Summarize a raw lab report into structured abnormalities using a local model "
        "running on the user's GPU. Use this whenever the user provides lab results as text."
    ),
)
def summarize_lab_report(
    lab_text: Annotated[str, Field(description="The raw text of the lab report to summarize.")],
) -> Dict[str, Any]:
    """
    Tool: summarize a lab report using Foundry Local (Phi-4-mini) on the user's GPU.

    Returns a JSON-compatible dict with:
    - overall_assessment: short text summary
    - notable_abnormal_results: list of abnormal test objects
    """

    payload = {
        "model": FOUNDRY_LOCAL_MODEL_ID,
        "messages": [
            {"role": "system", "content": LOCAL_LAB_SYSTEM_PROMPT},
            {"role": "user", "content": lab_text},
        ],
        "max_tokens": 256,
        "temperature": 0.2,
    }

    headers = {
        "Content-Type": "application/json",
    }

    print(f"[LOCAL TOOL] POST {FOUNDRY_LOCAL_CHAT_URL}")
    resp = requests.post(
        FOUNDRY_LOCAL_CHAT_URL,
        headers=headers,
        data=json.dumps(payload),
        timeout=120,
    )

    resp.raise_for_status()
    data = resp.json()

    # OpenAI-compatible shape: choices[0].message.content
    content = data["choices"][0]["message"]["content"]

    # Handle string vs list-of-parts
    if isinstance(content, list):
        content_text = "".join(
            part.get("text", "") for part in content if isinstance(part, dict)
        )
    else:
        content_text = content

    print("[LOCAL TOOL] Raw content from model:")
    print(content_text)

    # Strip ```json fences if present, then parse JSON
    cleaned = _strip_code_fences(content_text)
    lab_summary = json.loads(cleaned)
    print("[LOCAL TOOL] Parsed lab summary JSON:")
    print(json.dumps(lab_summary, indent=2))

    # Return dict – Agent Framework will serialize this as the tool result
    return lab_summary

The case, labs and prompt

All patient and provider information in below example is entirely fictitious and used for illustrative purposes only.

To illustrate the pattern, this sample prepares the “case” in code: it combines a symptom description with a lab report string and then submits that prompt to the agent. In production, these inputs would be captured from a UI or API.

# Example free-text case + raw lab text that the agent can decide to send to the tool
    case = (
        "Teenager with bad headache and throwing up. Fever of 40C and no other symptoms."
    )

    lab_report_text = """
   -------------------------------------------
   AI Land FAMILY LABORATORY SERVICES
        4420 Camino Del Foundry, Suite 210
             Gpuville, CA 92108
         Phone: (123) 555-4821  |  Fax: (123) 555-4822
    -------------------------------------------

    PATIENT INFORMATION
    Name:       Frontier Model
    DOB:        04/12/2007 (17 yrs)
    Sex:        Male
    Patient ID: AXT-442871
    Address:    1921 MCP Court, CA 01100

    ORDERING PROVIDER
    Dr. Bot, MD
    NPI: 1780952216
    Clinic: Phi Pediatrics Group

    REPORT DETAILS
    Accession #: 24-SDFLS-118392
    Collected:   11/14/2025 14:32
    Received:    11/14/2025 16:06
    Reported:    11/14/2025 20:54
    Specimen:    Whole Blood (EDTA), Serum Separator Tube

    ------------------------------------------------------
    COMPLETE BLOOD COUNT (CBC)
    ------------------------------------------------------
    WBC ................. 14.5     x10^3/µL      (4.0 – 10.0)     HIGH
    RBC ................. 4.61     x10^6/µL      (4.50 – 5.90)
    Hemoglobin .......... 13.2     g/dL          (13.0 – 17.5)    LOW-NORMAL
    Hematocrit .......... 39.8     %             (40.0 – 52.0)    LOW
    MCV ................. 86.4     fL            (80 – 100)
    Platelets ........... 210      x10^3/µL      (150 – 400)

    ------------------------------------------------------
    INFLAMMATORY MARKERS
    ------------------------------------------------------
    C-Reactive Protein (CRP) ......... 60 mg/L       (< 5 mg/L)     HIGH
    Erythrocyte Sedimentation Rate ... 32 mm/hr      (0 – 15 mm/hr) HIGH

    ------------------------------------------------------
    BASIC METABOLIC PANEL (BMP)
    ------------------------------------------------------
    Sodium (Na) .............. 138   mmol/L       (135 – 145)
    Potassium (K) ............ 3.9   mmol/L       (3.5 – 5.1)
    Chloride (Cl) ............ 102   mmol/L       (98 – 107)
    CO2 (Bicarbonate) ........ 23    mmol/L       (22 – 29)
    Blood Urea Nitrogen (BUN)  11    mg/dL        (7 – 20)
    Creatinine ................ 0.74 mg/dL        (0.50 – 1.00)
    Glucose (fasting) ......... 109  mg/dL        (70 – 99)        HIGH

    ------------------------------------------------------
    LIVER FUNCTION TESTS
    ------------------------------------------------------
    AST ....................... 28  U/L          (0 – 40)
    ALT ....................... 22  U/L          (0 – 44)
    Alkaline Phosphatase ...... 144 U/L          (65 – 260)
    Total Bilirubin ........... 0.6 mg/dL        (0.1 – 1.2)

    ------------------------------------------------------
    NOTES
    ------------------------------------------------------
    Mild leukocytosis and elevated inflammatory markers (CRP, ESR) may indicate an acute
    infectious or inflammatory process. Glucose slightly elevated; could be non-fasting.

    ------------------------------------------------------
    END OF REPORT
    SDFLS-CLIA ID: 05D5554973
    This report is for informational purposes only and not a diagnosis.
------------------------------------------------------

    """

    # Single user message that gives both the case and labs.
    # The agent will see that there are labs and call summarize_lab_report() as a tool.
    user_message = (
        "Patient case:\n"
        f"{case}\n\n"
        "Here are the lab results as raw text. If helpful, you can summarize them first:\n"
        f"{lab_report_text}\n\n"
        "Please provide non-emergency triage guidance."
    )

The Hybrid Agent code

Here’s where the hybrid behavior actually comes together. By this point, we’ve defined a local tool that talks to Foundry Local and configured access to a cloud model in Azure AI Foundry. In the main() function, the Agent Framework ties these pieces into a single workflow.

The agent runs locally, receives a message containing both symptoms and a raw lab report, and decides when to call the local tool.

The lab report is summarized on your GPU, and only the structured JSON is passed to the cloud model for reasoning. The snippet below shows how we attach the tool to the agent and trigger both local inference and cloud guidance within one natural-language prompt

# ========= Hybrid Main (Agent uses the local tool) =========

async def main():
    
    ...
    async with (
        AzureCliCredential() as credential,
        ChatAgent(
            chat_client=AzureAIAgentClient(async_credential=credential),
            instructions=SYMPTOM_CHECKER_INSTRUCTIONS,
            # 👇 Tool is now attached to the agent
            tools=[summarize_lab_report],
            name="hybrid-symptom-checker",
        ) as agent,
    ):
        result = await agent.run(user_message)

        print("\n=== Symptom Checker (Hybrid: Local Tool + Cloud Agent) ===\n")
        print(result.text)


if __name__ == "__main__":
    asyncio.run(main())

Testing the Hybrid Agent

Now I am running the agent code from VSCode and can see the local inference happening when lab was submitted. Then results are formatted, PII omitted and the GPT-40 model can process the symptom along the results

What's next

In this example, the agent runs locally and pulls in both cloud and local inference. In Part 2, we’ll explore the opposite architecture: a cloud-hosted agent that can safely call back into a local LLM through a secure gateway. This opens the door to more advanced hybrid patterns where tools running on edge devices, desktops, or on-prem systems can participate in cloud-driven workflows without exposing sensitive data.