Blog Post

Microsoft Developer Community Blog
12 MIN READ

Smoke Test Microsoft Foundry Agents with GitHub Actions

j_folberth's avatar
j_folberth
Icon for Microsoft rankMicrosoft
Jul 01, 2026

Add a lightweight CI/CD validation gate for Foundry Hosted Agents using smoke tests, the Responses API, and GitHub Actions.

Introduction

This blog is the next part of a series discussing Foundry Hosted Agents and how to properly construct Continuous Integration and Continuous Delivery (CI/CD) for them. This post will specifically go over configuring smoke tests against your recently deployed hosted agents. You can follow along with the complete codebase for this blog on the blog/smoke_tests branch.

 

Previous topics have been:

Deploying Foundry Hosted Agents via REST API

Deploying Foundry Hosted Agents from Source Code

Deploying Foundry Hosted Agents from Source Code

Prerequisites

To follow along, you will need:

  • An Azure subscription with access to Azure AI Foundry.
  • A deployed Foundry Hosted Agent.
  • The Foundry project endpoint and hosted agent name from the deployment output.
  • Python 3.10 or later available locally or on the GitHub Actions runner.
  • Azure CLI installed and signed in for local runs. The examples use an access token for the https://ai.azure.com/ resource. In GitHub Actions, authentication is handled through azure/login.
  • The sample repository checked out, including deployment/smoke-tests.py and deployment/smoke-tests.json.

Choosing a Scoped Agent Scenario

At this point we’ve gone over deploying Foundry Hosted Agents, a GitHub action deploying container based Foundry Hosted Agents, and deploying source code based Foundry Hosted Agents. The agent prompt for these examples was a simple “You are a friendly assistant. Keep your answers brief.” When we start talking about real world use cases and then tests to validate these cases then we should have a narrower prompt.

 

My kids were watching a Transformers movie. If you are unfamiliar with Transformers, it started as an ’80s cartoon and later became a series of action movies. So, I decided to create an agent whose specialty is knowing only all things Transformers.

What are smoke tests for AI agents?

The Agent Development Lifecycle (ADLC) covers how we build, deploy, test, and improve agents. In that lifecycle, smoke tests validate basic agent functionality with simple prompts. How is this different from unit and functional tests? Unit tests are narrow checks against specific pieces of code, while functional tests validate a specific behavior of an application.

 

With these smoke tests, we are checking two things: the agent generates a response, and that response aligns with the prompt. This distinction is important. Smoke tests validate basic behavior against the prompt, while evaluations are still used to measure response quality, tool calls, and other benchmarks.

Why run smoke tests after deploying hosted agents?

If evaluations are the methodology used for determining response quality, why not just use those? It’s a fair question, but evaluations can take a significant amount of time and may be costly to run. One of the core DevOps concepts of a DevOps culture is the ability to fail fast and learn from those failures.

 

Part of the reason I incorporated smoke tests was that I had agents deploy successfully, but they did not return responses due to a dependency issue. My CI/CD pipeline reported healthy checks, but the agent did not respond when provided prompts. Smoke tests are easy to run and provide a quick validation that our agent is working and returning responses aligned to our prompt.

 

This does not mean smoke tests replace evaluations. Smoke tests are a fast deployment gate: they tell us whether the hosted agent is reachable, responding, and following the most basic prompt expectations. Evaluations are still the better tool for measuring response quality, grounding, safety, tool usage, and regressions over time.

Smoke test scenarios for Foundry Agents

Now that we’ve established the importance and reasoning behind smoke tests, what kinds of test scenarios should we run?

I wouldn’t say there is a golden list of scenarios one needs to run. Here are the scenarios I landed on.

  • An on-topic response
  • Thread continuity validation (stateless)
  • Conversation creation (stateful)
  • Refusal to answer an off-topic question
  • Check for hallucinations
  • Context dependent question (more than one possible answer, needs context)

Together, these scenarios validate that the agent responds, maintains context, stays on topic, provides accurate information, and asks appropriate follow-up questions.

Define smoke tests

Now that we have criteria for our tests, let’s talk about the prompts we should use. For this sample, the agent is designed to answer only questions about Transformers.

 

The first task is deciding how to structure the test file. Since we want the smoke tests to be repeatable, scalable, and easy to update, the prompts should live in a separate file. For this sample, I chose JSON because it is easy to read, easy to maintain, and simple for the Python script to parse. The file’s structure looks like this:

{
  "tests": [
    {
      "id": "on_topic_transformers",
      "description": "Answers an on-topic Transformers question",
      "prompt": "Who is Optimus Prime?",
      "assertions": {
        "contains_any": ["autobot", "leader"],
        "contains_none": ["I cannot answer"]
      }
    }
  ]
}

This structure allows for a single test file containing multiple tests with different prompts and several assertion criteria. Each assertion checks the returned text for required or forbidden terms, allowing the same script to validate different agents by swapping the JSON file.

Test an on-topic response

{
  "id": "basic_response",
  "description": "Agent responds at all and answers in-domain.",
  "prompt": "Who is Optimus Prime?",
  "assertions": {
    "contains_any": ["optimus", "prime", "autobot"]
  }
}

Here we are checking whether an on-topic prompt returns a response with expected keywords. Again, we are not testing the full accuracy or quality of the answer here; we are validating that the agent responds in a way that aligns with the prompt.

Test response chaining with previous_response_id

[
  {
    "id": "thread_turn_1",
    "description": "First turn of a multi-turn conversation; captures the response id for turn 2.",
    "prompt": "Who is Megatron?",
    "assertions": {
      "contains_any": ["megatron"]
    },
    "save_response_id_as": "megatron_thread"
  },
  {
    "id": "thread_turn_2",
    "description": "Second turn using previous_response_id; answer must demonstrate context survived.",
    "prompt": "What faction does he lead?",
    "use_previous_response_id": "megatron_thread",
    "assertions": {
      "contains_any": ["decepticon"]
    }
  }
]

This test is specific to response chaining, where the first response returns an ID and the next request passes that value as previous_response_id. By default, the service stores response history server-side so the previous response can be referenced by ID. This keeps the test lightweight from the client perspective while still giving the model the context it needs for the next answer.

 

For this process to work, we need two prompts. The first prompt establishes the topic and saves the response ID. The second prompt asks a follow-up question and passes that saved response ID so the hosted agent can answer with the correct context. The focus of this test is response continuity without creating a platform-managed conversation resource.

 

If you need a no-store pattern, the Responses API also supports store: false. With that approach, the client must carry forward the prior output items as input to the next request instead of using previous_response_id. That is a different pattern than the smoke test shown here.

Test conversation-based threading

[
  {
    "id": "conversation_create",
    "description": "Create a Responses-protocol conversation resource; subsequent turns thread via the conversation id instead of previous_response_id.",
    "create_conversation_as": "starscream_convo"
  },
  {
    "id": "conversation_turn_1",
    "description": "First turn under a platform-managed conversation; establishes the subject for turn 2.",
    "prompt": "Who is Starscream?",
    "use_conversation": "starscream_convo",
    "assertions": {
      "contains_any": ["starscream"]
    }
  },
  {
    "id": "conversation_turn_2",
    "description": "Second turn under the same conversation id; pronoun 'he' must resolve to Starscream.",
    "prompt": "Who does he serve?",
    "use_conversation": "starscream_convo",
    "assertions": {
      "contains_any": ["megatron", "decepticon"]
    }
  }
]

Here we are leveraging a platform-managed conversation as a durable object with a unique identifier. Once the conversation is created, later turns can reference that same conversation ID, which is useful when working with hosted agents that need to preserve state across turns.

 

Conversation-based threading uses a different pattern. First, the test creates a platform-managed conversation resource and saves the returned conversation ID. Then each prompt that should participate in that stateful conversation passes the same conversation ID through use_conversation. This is different from previous_response_id: the response-chaining test passes the prior response ID directly, while the conversation test relies on the conversation resource to maintain the turn history.

Test refusal for off-topic questions

{
  "id": "offtopic_refusal",
  "description": "Off-topic question must be refused and must not leak the off-topic answer.",
  "prompt": "What is the capital of France?",
  "assertions": {
    "contains_any": ["only answer questions about transformers"],
    "contains_none": ["paris"]
  }
}

Here it is pretty evident what we are testing. Our agent was given instructions to only answer questions about Transformers. If a user asks something outside that knowledge base, the agent should respond that it can “only answer questions about Transformers.” A pass means the agent refuses the France question and does not leak the off-topic answer.

Test hallucination resistance

{
  "id": "no_hallucination",
  "description": "Fabricated premise must be rejected with an honesty marker.",
  "prompt": "In which EarthSpark episode does Optimus Prime marry Megatron?",
  "assertions": {
    "contains_any": [
      "i don't know",
      "i do not know",
      "not certain",
      "no such",
      "not aware",
      "cannot confirm",
      "no episode",
      "no storyline",
      "does not",
      "doesn't",
      "did not",
      "didn't",
      "never happens"
    ]
  }
}

This is not a catch-all hallucination test. It asks a question that is still in the Transformers world but never happens. In this case, it asks whether the two characters, Optimus Prime and Megatron, get married.

 

The important note here is that we are not scoring the overall credibility of the response. We are checking whether the agent rejects a fabricated premise instead of hallucinating an answer. That is why we include options like “I don’t know” and “not certain,” along with more direct phrases like “does not” and “did not.”

Test context-dependent questions

{
  "id": "continuity_aware",
  "description": "Continuity-dependent question must call out that the answer depends on continuity.",
  "prompt": "Who killed Optimus Prime?",
  "assertions": {
    "contains_any": ["continuity", "depends", "differs", "different", "varies", "depending on"]
  }
}

This last prompt addresses a scenario where there is more than one possible correct answer. Optimus Prime is killed in some versions of the story but not in others, so the expected behavior is for the agent to acknowledge that the answer depends on the continuity or ask for more context.

Execute smoke tests with Python

Now that we have the test scenarios, we need a repeatable way to run them. For this sample, that is a Python script that reads smoke-tests.json, calls the hosted agent, and validates the response.

 

Since this action is designed to run within the same workflow that deploys the agent, authentication should already be available from the deployment job. The script does not accept a bearer token as a command-line argument. Instead, it can use the token exposed through FOUNDRY_TOKEN in GitHub Actions, or it can use the Azure CLI session when running locally.

 

The script also needs the Foundry project endpoint and hosted agent name. In the deployment workflow, these should come from the previous deployment step’s outputs. In a local run, use the same project endpoint and agent name that were produced when the hosted agent was deployed. We also pass in smoke-tests.json, which contains the test definitions. This is an important part of the design: the script is not tied to one specific agent or one specific set of prompts. As long as the test file follows the expected structure, the same script and action can run smoke tests against any deployed hosted agent.

 

With those inputs, the script iterates through the test array, parses each test case, and sends the prompt to the Foundry Data Plane at {projectEndpoint}/agents/{name}/endpoint/protocols/openai/responses?api-version=2025-11-15-preview. Because this is a data plane operation, the request still needs a bearer token, but that token should come from the workflow or local Azure CLI authentication instead of being passed as a script argument.

 

The endpoint returns the raw response payload. The script extracts the response text and passes it into the assertion function, along with the assertion type (contains_any, contains_all, or contains_none) and the expected values.

 

Based on this logic, the script prints either a pass or a failure for each test. On failure, it includes the failed condition and the response that was returned, which makes it easier to tune the prompt instructions or assertion criteria. The complete script is available in the sample repository at deployment/smoke-tests.py.

 

A successful run should produce output similar to the following:

Project endpoint : https://<account>.services.ai.azure.com/api/projects/<project>
Tests            : 9 from smoke-tests.json
Agents           : agent-framework-agent-basic-responses-src
Per-req timeout  : 120.0s

--- Agent: agent-framework-agent-basic-responses-src ---
  PASS  basic_response
  PASS  thread_turn_1
  PASS  thread_turn_2
  PASS  conversation_create
  PASS  conversation_turn_1
  PASS  conversation_turn_2
  PASS  offtopic_refusal
  PASS  no_hallucination
  PASS  continuity_aware
  -> 9/9 passed for agent-framework-agent-basic-responses-src

=== Summary: 9/9 passed across 1 agent(s) ===

This output confirms the script found the test file, authenticated to the Foundry project endpoint, executed all configured tests, and returned a passing summary.

 

If the smoke test fails before running any prompts, start by checking authentication and endpoint configuration. A 401 or 403 usually means the workflow did not acquire a valid token or the identity does not have access to the Foundry project. A 404 usually means the project endpoint or agent name does not match the deployed hosted agent. If the script times out, the agent may still be cold starting or the timeout value may need to be increased for the first request after deployment.

 

If a test runs but fails an assertion, review the response text printed by the script. In that case, the hosted agent is reachable, but either the prompt instructions need to be tightened or the assertion terms need to be adjusted to match the expected response.

Run smoke tests in GitHub Actions

So how do we implement this script as a reusable GitHub Action that can be used across any agent and test file? The composite action wraps the Python script and exposes the values that change between deployments as inputs: the project endpoint, agent name, test file, script path, and timeout.

 

Parameterizing the script path also gives you the option to centralize the script for scale. If you centralize the action outside of the repository, pin the reusable action to a version tag or commit SHA so workflow behavior does not change unexpectedly.

 

The complete composite action is available in the sample repository at action.yml. The core structure looks like this:

name: Smoke-test Foundry Agent
description: |
  POST a battery of prompts to a deployed hosted agent's Responses endpoint and
  assert response behaviours from a JSON test catalog. Wraps deployment/smoke-tests.py.

  Contract: caller must have already run actions/checkout@v6 (so the runner
  script and tests file are on disk) and azure/login@v3 (the runner uses
  `az account get-access-token` to acquire a Foundry data-plane token).

inputs:
  project_endpoint:
    description: Foundry project endpoint URL (e.g. https://<account>.services.ai.azure.com/api/projects/<project>)
    required: true
  agent_name:
    description: Name of the deployed hosted agent to smoke-test
    required: true
  tests_file:
    description: Path to the smoke-tests JSON catalog, relative to the repo root
    required: false
    default: 'deployment/smoke-tests.json'
  script_path:
    description: Path to the smoke-tests runner script, relative to the repo root
    required: false
    default: 'deployment/smoke-tests.py'
  timeout:
    description: Per-request timeout in seconds (covers cold-start)
    required: false
    default: '120'

runs:
  using: composite
  steps:
    - name: Run smoke tests
      shell: bash
      env:
        PROJECT_ENDPOINT: ${{ inputs.project_endpoint }}
        AGENT_NAME: ${{ inputs.agent_name }}
        TESTS_FILE: ${{ inputs.tests_file }}
        SCRIPT_PATH: ${{ inputs.script_path }}
        TIMEOUT: ${{ inputs.timeout }}
      run: |
        python3 "$SCRIPT_PATH" \
          --project-endpoint "$PROJECT_ENDPOINT" \
          --agent-name "$AGENT_NAME" \
          --tests-file "$TESTS_FILE" \
          --timeout "$TIMEOUT"

Call the smoke test action after deployment

This action is intended to run after the deployment steps have completed. At that point, the workflow should already have the project endpoint and agent name available as outputs, so the smoke test action can use those values directly.

 

This call assumes it is running inside a deployment workflow that has already checked out the repository, authenticated to Azure with azure/login, and produced the Foundry project endpoint and hosted agent name as outputs from the deployment step. For a standalone workflow, include permissions: id-token: write and contents: read, then run actions/checkout and azure/login before invoking the smoke test action.

 

With those outputs available, calling the action is straightforward:

- name: Smoke test
  uses: ./.github/actions/smoke-test
  with:
    project_endpoint: ${{ needs.deploy-iac.outputs.project_endpoint }}
    agent_name: ${{ inputs.agent_name }}

Conclusion

Smoke tests give us a lightweight way to validate that our Foundry Hosted Agent is deployed, reachable, and responding as expected before we treat a release as successful. They are not a replacement for deeper evaluations, but they are an effective first gate in the Agent Development Lifecycle because they quickly catch broken deployments, missing configuration, authentication issues, or obvious instruction regressions.

 

In this post, we added a reusable smoke test script, defined test cases in JSON, and wired the checks into GitHub Actions so every deployment can verify the hosted agent automatically. This helps move agent deployment closer to the same repeatable DevOps practices we expect from application code: deploy, validate, and fail fast when something is wrong.

 

From here, these smoke tests can be expanded into broader evaluation workflows that measure response quality, grounding, safety, tool usage, and regression behavior over time. Together, smoke tests and evaluations provide a practical foundation for building and operating hosted agents with more confidence.

Updated Jun 29, 2026
Version 1.0