Most CI/CD pipelines start the same way. A developer creates a repository, adds a .github/workflows folder, writes some YAML, and ships. It works. For a while. Then the second service arrives. Then the tenth. Then the fiftieth. Suddenly you have fifty slightly different pipelines — each one a copy of the last, each one drifting in a different direction. One team pins a different Python version. Another skips tests on Fridays. A third has a deploy step nobody fully understands anymore. This is not a tooling problem. It is an architecture problem.
The First Shift — Treating CI/CD as a Platform
The first insight is straightforward but underused:
Your CI/CD logic is infrastructure. It deserves the same design discipline as your application code.
That means centralizing it. Versioning it. Exposing it as reusable, callable workflows — not copy-pasted YAML scattered across dozens of repos.
In Part 1 of this series, we build exactly that. A platform repository that defines reusable GitHub Actions workflows for testing, building, and deploying containerized services to Azure. Application repos stay thin — they simply call the platform, like invoking an API.
Build once. Deploy anywhere. Fix once. Every team benefits.
The Second Shift — Governing AI Behavior
But software is changing.
We are no longer just shipping APIs and microservices. We are shipping AI agents — systems that reason, respond, and make decisions. And these systems break the assumptions that traditional CI/CD was built on.
A unit test can tell you whether your code is correct. It cannot tell you whether your AI agent is trustworthy. Prompts behave like code but drift differently. Model outputs are probabilistic. Quality degrades silently, without a failed test to catch it.
This creates a new engineering challenge:
How do you build a delivery pipeline for something that does not have a deterministic right answer?
In Part 2, we extend the platform to answer that question. We introduce evaluation as a deployment gate — a reusable workflow that scores agent behavior before any deployment is allowed. We integrate with Microsoft Foundry for agent runtime and observability. And we show how the same platform-thinking from Part 1 applies directly to AI systems.
What This Series Is Really About
This is not a tutorial on GitHub Actions syntax.
It is about maturity — the difference between a team that writes pipelines and a team that designs delivery systems. Between an organization that ships code and one that governs behavior.
By the end of both parts, you will have:
- A reusable CI/CD platform that scales across any number of services
- An evaluation-driven delivery pipeline for AI agents
- A mental model for treating both code and AI as governed, versioned artifacts
The tools are GitHub Actions and Azure. The principle is platform thinking.
Let's build it.
The Problem — Why CI/CD Pipelines Don't Scale
Every pipeline starts simple.
You create a repository, add a workflow file, and within minutes your code is building and deploying automatically. It feels like a solved problem.
It isn't.
The Reality of Growth
The first pipeline is straightforward. The second is a copy of the first. The third is a copy of the second — with one small adjustment. By the time you have ten services, you have ten slightly different pipelines, each one drifting quietly away from the others.
This is pipeline sprawl — and it is far more costly than it appears.
Consider what happens in practice:
- One team upgrades their Python version. Others don't.
- A security fix gets applied to three pipelines. The other seven are missed.
- A new compliance requirement means updating every workflow file — manually, one repo at a time.
- A new engineer onboards using an old workflow and ships a pattern that was deprecated months ago.
None of this feels critical in the moment. But over time, your CI/CD layer becomes the most inconsistent, unmaintainable, and ungoverned part of your infrastructure — even though it controls everything that ships to production.
The Deeper Problem — No Separation of Concerns
The root cause is not a tooling limitation. It is a design problem.
Most teams treat CI/CD as something that lives inside an application repo — a secondary concern, not a first-class system. That model works at small scale. It breaks at org scale.
When CI/CD logic is distributed across every application repo:
- There is no single source of truth for how deployments work
- Platform teams cannot enforce standards without touching every repo individually
- Security and compliance teams have no centralized control plane
- Onboarding a new service means rebuilding from scratch — or copying from an outdated reference
The Cost You Don't See
The real cost of this pattern is not the duplicated YAML. It is the compounding overhead:
| Problem | Visible Cost | Hidden Cost |
|---|---|---|
| Duplicated pipelines | Time to replicate | Drift and inconsistency over time |
| No centralized logic | Minor friction | Security gaps across repos |
| Manual updates | One-time effort per change | Multiplied across every service |
| No versioning | Manageable today | Breaking changes with no rollback path |
What the Solution Looks Like
The answer is not a better YAML template.
It is a platform.
Specifically — a centralized repository that owns CI/CD logic, exposes it as reusable versioned workflows, and lets every application team consume it without duplicating a single line of pipeline code.
This is the same principle that drives every mature engineering organization:
Don't repeat infrastructure. Abstract it. Version it. Share it.
That is exactly what we are going to build.
The Architecture — What You're Building
Before writing a single line of code, it is worth understanding the system as a whole.
The architecture is intentionally simple. Two repositories. One cloud infrastructure. One clear separation of responsibilities.
The Two-Repo Model
This separation is the core design decision. Everything else follows from it.
The platform repo is not an application. It does not ship features. It ships workflow infrastructure — reusable, versioned, callable by any application team in your organization.
The application repo is deliberately thin on CI/CD. It contains a single workflow file that calls the platform. Nothing more.
How They Connect
The connection happens through GitHub's workflow_call trigger — a mechanism that allows one workflow to invoke another across repositories.
The application repo does not care how the build works. It only cares about the contract — inputs it needs to provide, outputs it can expect back.
This is the same mental model as an API:
The caller knows the interface. The platform owns the implementation.
The Deployment Flow
Once triggered, the pipeline moves through four clearly defined stages:
A few things to note about this flow:
- The image is built exactly once. The same artifact moves through every environment — no rebuilds, no drift.
- The Git SHA is the image tag. Every deployment is fully traceable back to a specific commit.
- GitHub Environments control approvals. Staging and production are separate environments with configurable protection rules — no custom approval logic needed.
The Azure Infrastructure
On the cloud side, the system uses two Azure services:
| Service | Role |
|---|---|
| Azure Container Registry (ACR) | Stores Docker images |
| Azure Container Apps | Runs the application in staging and production |
Both are provisioned using Bicep — Azure's infrastructure-as-code language — so the infrastructure is versioned and repeatable alongside the workflows.
Responsibility Map
Here is how responsibilities are distributed across the system:
| Layer | Owns | Does Not Own |
|---|---|---|
| Platform Repo | Test logic, build logic, deploy logic | Application code |
| Application Repo | Business logic, Dockerfile, requirements | Pipeline implementation |
| Azure | Runtime, registry, networking | Deployment decisions |
This clean separation means:
- Platform teams can update CI/CD logic without touching application code
- Application teams can ship features without understanding pipeline internals
- Infrastructure changes are isolated to the Bicep layer
Why This Scales
The real power of this architecture becomes clear at scale.
With fifty microservices:
One change to deploy.yml in the platform repo propagates to every service on the next run. No manual updates. No drift. No inconsistency.
This is what CI/CD as a platform means in practice.
Platform Repo — Structure and Reusable Workflows
The platform repo is the heart of this system. Everything it contains is designed to be reusable, versioned, and consumed by any application team in your organization.
Let's walk through it in full.
Repository Structure
Three workflows. One infrastructure file. That is the entire platform.
Each workflow has a single, well-defined responsibility:
| Workflow | Responsibility |
|---|---|
| test-python.yml | Install dependencies and run tests |
| build.yml | Build Docker image and push to ACR |
| deploy.yml | Deploy a specific image to a specific environment |
Workflow 1 — test-python.yml
This workflow handles dependency installation and test execution for any Python-based service.
name: test-python
on:
workflow_call:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11.9"
- run: pip install -r requirements.txt
- run: pytest
What to note:
- The on: workflow_call trigger is what makes this reusable. It cannot be triggered directly — it must be called by another workflow.
- The Python version is pinned to 3.11.9 — not a floating version like 3.11. This ensures every service tests against the exact same runtime, eliminating environment-specific failures.
- Any application repo that calls this workflow gets consistent, centrally maintained test execution — without defining any of this logic themselves.
Workflow 2 — build.yml
This workflow builds the Docker image, tags it with the Git SHA, and pushes it to Azure Container Registry.
name: build
on:
workflow_call:
outputs:
image_tag:
value: ${{ jobs.build.outputs.image_tag }}
jobs:
build:
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.meta.outputs.tag }}
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- id: meta
run: echo "tag=${GITHUB_SHA}" >> $GITHUB_OUTPUT
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: az acr login --name ${{ secrets.ACR_NAME }}
- run: |
docker build -t ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ github.sha }} .
docker push ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ github.sha }}
What to note:
- outputs — This workflow exposes image_tag as an output. The calling workflow captures this value and passes it downstream to the deploy workflow. This is how the same image tag flows from build → staging → production without being hardcoded anywhere.
- id-token: write — This permission enables OIDC-based authentication with Azure. No long-lived credentials are stored as secrets. GitHub generates a short-lived token at runtime, which Azure trusts via a federated identity configuration. This is the recommended authentication pattern for production workloads.
- ${GITHUB_SHA} — Using the commit SHA as the image tag makes every build fully traceable. Given any running container, you can identify the exact commit it was built from.
Workflow 3 — deploy.yml
This workflow deploys a given image to a given environment in Azure Container Apps.
name: deploy
on:
workflow_call:
inputs:
environment:
required: true
type: string
image_tag:
required: true
type: string
app_name:
required: true
type: string
jobs:
deploy:
runs-on: ubuntu-latest
environment: ${{ inputs.environment }}
steps:
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: |
az containerapp update \
--name ${{ inputs.app_name }} \
--resource-group ${{ secrets.AZURE_RESOURCE_GROUP }} \
--image ${{ secrets.ACR_LOGIN_SERVER }}/app:${{ inputs.image_tag }}
What to note:
- Three inputs — environment, image_tag, and app_name. This single workflow handles every environment. The caller decides where to deploy by passing inputs — the workflow itself has no hardcoded environment logic.
- environment: ${{ inputs.environment }} — This line is deceptively powerful. By mapping the job's environment to the input value, GitHub automatically applies whatever protection rules are configured for that environment — required reviewers, wait timers, deployment policies. Approval gates come for free.
- secrets: inherit — When the calling workflow passes secrets: inherit, Azure credentials flow through automatically without being re-declared. Secrets are managed once, at the org or repo level.
The Versioning Contract
One detail that makes this system production-ready is workflow versioning.
When an application repo calls a platform workflow, it references a specific version:
The v1 tag means:
- Application teams are insulated from breaking changes in the platform
- Platform teams can ship improvements without forcing immediate upgrades
- You can run v1 and @v2 side by side during migrations
- Every deployment is traceable to a specific platform version
This versioning model is what separates a platform from a shared folder of YAML files.
What Application Teams See
From an application team's perspective, the entire platform surface looks like this:
Three uses statements. That is the entire CI/CD surface an application team needs to understand.
Everything else — authentication, image tagging, registry login, container update commands — is abstracted away inside the platform.
Azure Infrastructure
The platform workflows handle CI/CD logic. The Azure infrastructure handles the runtime — where your containers live, how they are stored, and how they are served to the outside world.
All infrastructure is defined in Bicep — Azure's native infrastructure-as-code language. This means your infrastructure is versioned, repeatable, and deployable from a single command.
Why Bicep
Before diving into the code, it is worth briefly explaining the choice.
Bicep compiles down to ARM templates but is significantly more readable. It integrates natively with Azure's resource model, requires no external state management, and fits naturally alongside GitHub Actions workflows.
For teams already working within the Azure ecosystem, it is the most straightforward path to infrastructure-as-code without introducing additional tooling dependencies.
Infrastructure Structure
The entire infrastructure is defined in a single file. For this architecture, you need two resources:
| Resource | Purpose |
|---|---|
| Azure Container Registry (ACR) | Stores and serves Docker images |
| Azure Container Apps | Runs containers in a managed serverless environment |
main.bicep
param location string = resourceGroup().location
// Azure Container Registry
resource acr 'Microsoft.ContainerRegistry/registries@2023-01-01-preview' = {
name: 'myregistry'
location: location
sku: { name: 'Basic' }
}
// Azure Container App (Staging + Production)
resource containerApp 'Microsoft.App/containerApps@2023-05-01' = {
name: 'my-app'
location: location
properties: {
configuration: {
ingress: {
external: true
targetPort: 8000
}
}
}
}
Breaking It Down
Container Registry
resource acr 'Microsoft.ContainerRegistry/registries@2023-01-01-preview' = {
name: 'myregistry'
location: location
sku: { name: 'Basic' }
}
The ACR is the central image store for your entire platform. Every image built by build.yml is pushed here, tagged with its Git SHA. Both staging and production pull from this registry — ensuring the exact same artifact runs in both environments.
The Basic SKU is sufficient for most team-scale workloads. For larger organizations with higher throughput requirements, Standard or Premium SKUs offer geo-replication and increased storage limits.
Container App
resource containerApp 'Microsoft.App/containerApps@2023-05-01' = {
name: 'my-app'
location: location
properties: {
configuration: {
ingress: {
external: true
targetPort: 8000
}
}
}
}
Azure Container Apps provides a fully managed serverless container runtime. You define what runs — it handles scaling, networking, and availability.
Two things to note here:
- external: true — Makes the application publicly accessible over HTTPS. Azure Container Apps automatically provisions a fully qualified domain name and TLS certificate.
- targetPort: 8000 — Maps to the port exposed by the FastAPI application inside the container. This must match the --port argument in your CMD instruction in the Dockerfile.
Staging vs. Production
You will deploy this infrastructure twice — once for staging, once for production — with different resource names:
# Deploy staging
az deployment group create \
-- resource-group rg-ciplatform-staging \
-- template-file infra/main.bicep
# Deploy production
az deployment group create
-- resource-group rg-ciplatform-production \
-- template-file infra/main.bicep
The deploy.yml workflow then targets the correct app by name via the app_name input:
This keeps staging and production fully isolated at the infrastructure level while sharing the same workflow logic.
GitHub Environments and Approval Gates
On the GitHub side, you configure two Environments — staging and production — inside your repository settings.
For production, add a required reviewer protection rule:
When the pipeline reaches the deploy-prod job, GitHub will pause and wait for a designated reviewer to approve before proceeding. This approval gate costs nothing extra — it is built into GitHub's environment model and wired automatically through the environment: field in deploy.yml.
Setting Up Azure Authentication
The workflows authenticate to Azure using OpenID Connect (OIDC) — a keyless authentication method that eliminates the need for long-lived service principal secrets.
Set up the federated identity once:
# Create a service principal
az ad app create -- display-name "github-actions-platform"
# Add federated credential for your repo
az ad app federated-credential create \
-- id <app-id> \
-- parameters '{
"name": "github-actions",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:your-org/fastapi-app:ref:refs/heads/main",
"audiences": ["api://AzureADTokenExchange"]
}'
Then add these three secrets to your GitHub repository:
| Secret | Value |
|---|---|
| AZURE_CLIENT_ID | Application (client) ID |
| AZURE_TENANT_ID | Directory (tenant) ID |
| AZURE_SUBSCRIPTION_ID | Azure subscription ID |
| AZURE_RESOURCE_GROUP | Target resource group name |
| ACR_NAME | Container registry name |
| ACR_LOGIN_SERVER | Registry login server (e.g. myregistry.azurecr.io) |
With these in place, every workflow that calls azure/login@v2 authenticates automatically — no passwords, no rotation, no expiry management.
Application Repo — Structure, Code, and Release Workflow
With the platform repo in place, the application repo becomes remarkably simple. Its only CI/CD responsibility is to call the platform — everything else is focused purely on application logic.
This is the goal: application teams ship features, not pipelines.
Repository Structure
This is the entire CI/CD footprint of the application repo.
The Application — src/main.py
The application is a minimal FastAPI service with a single endpoint that returns the current deployed version and environment.
from fastapi import FastAPI
import os
app = FastAPI()
@app.get("/version")
def version():
return {
"version": os.getenv("GITHUB_SHA", "dev"),
"environment": os.getenv("APP_ENV", "local")
}
This endpoint serves a practical purpose beyond demonstration. In a real system, a /version or /health endpoint like this allows you to:
- Verify which commit is running in each environment
- Confirm a deployment succeeded without inspecting container logs
- Detect environment mismatches between staging and production
requirements.txt
All dependencies are pinned to exact versions. This ensures the same packages install in every environment — local development, CI, staging, and production — eliminating version drift as a source of failures.
Dockerfile
FROM python:3.11.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src ./src
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
What to note:
- python:3.11.9-slim — The base image uses the same Python version as the platform's test-python.yml workflow. Consistency between the test environment and the container runtime eliminates an entire class of environment-specific bugs.
- Dependency layer first — requirements.txt is copied and installed before application source code. This is a deliberate layer ordering decision — Docker caches the dependency layer independently, so subsequent builds only reinstall packages when requirements.txt changes, not on every code change.
- 0.0.0.0 — Binds the server to all network interfaces inside the container, making it reachable from outside. Combined with targetPort: 8000 in the Bicep configuration, this completes the network path from Azure Container Apps to the application.
The Release Workflow — release.yml
This is the most important file in the application repo. It is also the simplest.
name: release
on:
push:
branches: [main]
permissions:
id-token: write
contents: read
jobs:
test:
uses: ns-github-design/ci-platform/.github/workflows/test-python.yml@v1
build:
needs: test
uses: ns-github-design/ci-platform/.github/workflows/build.yml@v1
secrets: inherit
deploy-staging:
needs: build
uses: ns-github-design/ci-platform/.github/workflows/deploy.yml@v1
with:
environment: staging
image_tag: ${{ needs.build.outputs.image_tag }}
app_name: my-app-staging
secrets: inherit
deploy-prod:
needs: [build, deploy-staging]
uses: ns-github-design/ci-platform/.github/workflows/deploy.yml@v1
with:
environment: production
image_tag: ${{ needs.build.outputs.image_tag }}
app_name: my-app-prod
secrets: inherit
Walking Through the Pipeline
Trigger
Every merge to main triggers a full release. This reflects a trunk-based delivery model — main is always releasable, and every commit to it initiates the path to production.
Test Job
The first job calls the platform's test workflow. No configuration required — the platform handles Python setup, dependency installation, and test execution. The application team owns the test files; the platform owns the execution environment.
Build Job
The build job runs only after tests pass. It calls the platform's build workflow and inherits all secrets automatically — Azure credentials, ACR login server, registry name — without re-declaring them.
The critical output here is image_tag — the Git SHA of the current commit. This value is captured and passed downstream to both deploy jobs.
Deploy to Staging
The staging deployment runs immediately after a successful build. It passes three inputs to the deploy workflow:
- environment: staging — triggers GitHub's staging environment rules
- image_tag — the exact SHA built in the previous job
- app_name: my-app-staging — the target Container App in Azure
Deploy to Production
Production deployment runs only after staging succeeds. It uses the same image_tag — the identical image that just ran successfully in staging is what gets promoted to production. No rebuild. No repackaging. The artifact is immutable.
If a required reviewer is configured on the production GitHub Environment, the pipeline pauses here until approval is granted.
The Complete Pipeline at a Glance
What the Application Team Never Has to Think About
It is worth being explicit about what this model abstracts away from application engineers:
| Concern | Handled By |
|---|---|
| Azure authentication | Platform (build.yml, deploy.yml) |
| Docker build and push | Platform (build.yml) |
| Image tagging strategy | Platform (build.yml) |
| Container App update command | Platform (deploy.yml) |
| Approval gate mechanics | GitHub Environments |
| Python version consistency | Platform (test-python.yml) |
The application team's CI/CD knowledge requirement is reduced to understanding three uses statements and two with input blocks. Everything else is the platform's responsibility.
Demo — Proving It Works
Your pipeline is now live and connected across three layers:
- GitHub Actions (Reusable Workflows) – powering CI/CD logic
- FastAPI Application Repo – consuming those workflows
- Azure Container Apps – running staging and production
Step 1 – Trigger the CI/CD Pipeline
Push any commit to the main branch:
Then open:
You’ll see the workflow release start automatically.
Step 2 – Observe the Pipeline Run
The jobs execute in sequence:
| Stage | Description |
|---|---|
| test | Runs pytest inside GitHub Actions using the reusable workflow test-python.yml |
| build | Builds and tags a Docker image with the current Git SHA, then pushes to ACR |
| deploy‑staging | Deploys that same image to your Container App my-app-staging |
| approval gate | Waits for approval of the production environment |
| deploy‑prod | On approval, promotes the identical image to my-app-prod |
Your final dependency chain looks like this:
(You added needs: [build, deploy-staging]—perfect for ensuring the correct ordering.)
Step 3 – Review the Logs
Every job’s output is visible inside GitHub Actions:
- test – confirms tests collected successfully
- build – shows docker push ... to ACR
- deploy‑staging – displays Azure CLI output updating the Container App
- deploy‑prod – mirrors those steps after manual approval
This transparency is part of what makes reusable workflows auditable and support enterprise compliance.
Step 4 – Verify Running Apps
After both deployments succeed, confirm each environment is live.
Staging
Production
Expected response:
(The exact commit SHA replaces "abc1234".)
This proves:
- The same container image was promoted unchanged.
- Both environments are consistent.
- The platform’s reusable workflows handled the full delivery flow.
The Bridge: Why AI Changes Everything
Your CI/CD platform now runs like a product: build once, test once, deploy anywhere.
But software itself is shifting.
The next generation of systems doesn’t just serve requests — it reasons.
We are no longer only shipping code.
We are shipping AI agents that evolve, learn, and behave based on prompts, data, and context.
And that introduces a new set of engineering realities.
The Old Contract
Traditional CI/CD pipelines assume:
- Code is deterministic
- Tests define correctness
- Deployments promote immutable artifacts
Those assumptions hold for APIs and microservices.
The New Reality with AI Systems
AI systems violate the core idea of “deterministic correctness.”
| Characteristic | Traditional Software | AI / Agent Systems |
|---|---|---|
| Behavior | Deterministic | Probabilistic |
| Definition of success | Binary pass/fail | Continuous score |
| Changes | Source code edits | Prompt/model/data changes |
| Validation method | Unit tests | Semantic evaluation |
| Risks | Bugs | Hallucination / drift / bias |
Prompts, fine‑tuned models, retraining data, and external tool integrations become active code paths — yet they can’t be meaningfully validated with unit tests alone.
Why This Breaks Standard CI/CD
Your current CI/CD system answers only one question:
“Did the code pass its tests?”
But for an AI agent, that’s not enough.
You also need to know:
“Did the model behave acceptably across metrics that matter?”
Without that gate, an AI update that produces worse responses could still deploy perfectly — because the pipeline has no concept of semantic quality.
The Missing Layer — Evaluation
What testing is to code, evaluation is to AI.
It separates experimental prompts from production‑ready agents.
This leads to the next maturity step:
Extend your CI/CD platform into an AI Delivery Platform — one that can evaluate, score, and gate agent behavior before deployment.
What Changes Technically
You don’t replace the CI/CD you built.
You add a new reusable workflow to the same platform:
This new workflow introduces a stage that:
- Runs offline or dataset‑based evaluation scripts
- Computes a confidence / quality score
- Blocks deployment if performance falls below threshold
What This Means Philosophically
- Build pipelines become governance systems
- Platform teams now own evaluation as much as deployment
- Reusable workflows become policies for AI reliability
The same architecture — reusable calls, versioned workflows, staged promotions — continues serving you, but with a new function: safeguarding machine behavior.
Evaluation as a Gate
Your reusable CI/CD system already enforces two things:
- Code quality → through tests
- Deployment consistency → through shared workflows
The next maturity layer is enforcing behavioral quality — ensuring an AI agent performs to a defined standard before it goes live.
That’s where evaluation pipelines come in.
The Big Shift
In conventional systems:
For AI systems:
Instead of pass/fail assertions, you now gate deployments on scores — accuracy, relevance, factuality, safety, or any quantitative prompt‑response metric.
Reusable Workflow — evaluate-agent.yml
Add this new file to your platform repository:
File content:
Example Evaluation Script — eval.py
This script executes semantic evaluation logic for your agent.
As a proof‑of‑concept, this produces a random score.
In real use, this could compute accuracy against a dataset, compare responses to a gold standard, or call an LLM‑based judge service.
Integrating the New Stage
In your AI app repo (for example, agent-app or fastapi-app once it evolves into an agent):
This creates a simple but powerful control flow:
If eval.py writes a score below 0.8, the pipeline stops immediately — deployment blocked, logs recorded, everything traceable.
Key Takeaways
| Concept | Description |
|---|---|
| Reusable | Same evaluate-agent workflow can gate hundreds of models |
| Configurable | Each use can override thresholds or metrics |
| Auditable | Evaluation scores logged as build artifacts |
| Safe | Prevents low-performing or biased agents from promotion |
Beyond Thresholds
Later, you can evolve this into:
- Adaptive thresholds per metric
- Human‑in‑the‑loop approvals for borderline scores
- Trend tracking – scores over time via GitHub Checks or dashboards
- Integration with observability platforms (Azure App Insights, Foundry evaluations, etc.)
AI Delivery Pipeline + Foundry Integration
So far, you have:
- A unified CI/CD platform powered by reusable GitHub Actions
- Evaluation pipelines that gate AI deployments
Now we expand that architecture into a complete AI Delivery Platform by integrating with Microsoft Foundry.
The Goal
Combine:
- GitHub Actions ↔ Foundry for seamless build‑evaluate‑deploy cycles
- Reusable workflows for policies + governance
- Foundry runtime for execution, scaling, and observability of agents
This transforms your CI/CD system into a behavior‑driven deployment layer for AI.
Conceptual Flow
Reusable CI/CD Workflows + Foundry Runtime
Your existing ci-platform repo now gains a fourth reusable workflow:
Each of these maps to a Foundry capability:
| Workflow | Foundry Capability | Role |
|---|---|---|
| build.yml | Model packaging & versioning | Creates deployable image |
| evaluate-agent.yml | Evaluation service | Runs offline or dataset‑based checks |
| deploy.yml | Agent deployment | Publishes agent to Foundry runtime |
| (Additional) monitor.yml | Telemetry | Pulls evaluation metrics post‑deploy |
Example Foundry‑Aware Pipeline
In an AI repository (e.g., agent-app):
This sequence guarantees that only successfully evaluated agent versions are deployed to Foundry.
How Foundry Fits In
Microsoft Foundry provides:
- Agent runtime — scalable, managed environment for composable agents
- Evaluation tools — integrate LLM‑as‑judge, dataset scoring, or automatic benchmarks
- Observability layers — performance metrics, feedback loops, and telemetry
- Orchestration frameworks — connect multiple tools or sub‑agents into an ecosystem
GitHub Actions handles delivery logic.
Foundry handles AI execution and lifecycle.
Together, they form a modular operations stack for AI systems.
Benefits of Integration
| Benefit | Description |
|---|---|
| Governed Deployments | Only evaluated and approved agent versions reach Foundry |
| Traceability | Every deployed agent is linked to a Git commit and eval score |
| Reproducibility | Re‑running pipeline with the same commit reproduces identical behavior |
| Observability | Foundry telemetry pushes real‑world feedback back into the platform repo |
Architecture View
Governance in Practice
- Every deployment is evaluated before release.
- Every evaluation is logged as metadata in the Actions run.
- Foundry stores live metrics that can trigger automated re‑evaluation workflows downstream.
This unifies the DevOps and MLOps worlds under one pipeline.
Advanced Practices
Integrating evaluation and Foundry is the foundation. True enterprise reliability comes from how you operate and evolve those pipelines over time. Below are the main practices that transform this setup from “it works” to “it scales safely.”
1. Prompt Versioning
In AI systems, prompts are code.
A single word change in a prompt can shift an agent’s behavior as much as a logic rewrite does in software. Treat them accordingly:
- Store prompts and configurations in git (/prompts/prompt_v1.txt, prompt_v2.txt).
- Use clear change history — commits = versions.
- Reference prompt versions explicitly in deployment metadata:
- Re-runs of an old version must reproduce identical responses; versioned prompts make that possible.
2. Experiment Tracking
Track every experiment like you track every deployment.
| Item | Example Format |
|---|---|
| Commit SHA | f9a3c2a |
| Prompt version | prompt_v3 |
| Model checkpoint | gpt‑35‑turbo 2024‑06‑01 |
| Dataset revision | dataset_v2 |
| Evaluation score | 0.87 |
Implementation tips:
- Write a short artifact file (experiment.json) in each pipeline run.
- Store it as a workflow artifact or upload it to an experiment tracker (MLflow, Azure ML Experiments, Foundry History).
- You can later analyze how prompt or model changes affect score trends.
This allows data‑driven improvement cycles: evaluate → compare → promote → monitor.
3. Rollback Strategies
For deterministic software:
Rollback = redeploy previous container.
For AI systems you may need to rollback three dimensions:
| Dimension | Example Rollback |
|---|---|
| Code | Checkout previous commit |
| Prompt | Revert to earlier prompt file |
| Model | Reuse prior checkpoint or model ID |
Best practice: treat each version triple (code, prompt, model) as one immutable release unit in the pipeline.
GitHub tags + evaluation artifacts = auditable rollback point.
4. Continuous Evaluation
Evaluation shouldn’t stop at deployment.
Integrate post‑deployment monitoring jobs to detect drift:
Benefits:
- Detects silent performance drops caused by new data or model API changes.
- Keeps models aligned with their initial standards.
- Creates long‑term confidence for compliance audits.
5. Fail Fast, Fail Safe
Configure pipelines such that failure to evaluate = failure to deploy.
When in doubt, err on protection.
Failures should be logged, retriable, and transparent — never silent.
This approach builds institutional trust in AI releases the same way software regression testing built trust in traditional CI/CD.
6. Governance by Design
Use GitHub’s native features (branch protections, required reviews, environment rules) as declarative governance.
Combine them with Foundry’s policy hooks:
- restrict which teams can promote evaluated agents;
- enforce minimum score thresholds;
- auto‑disable underperforming models.
Governance embedded in code scales better than manual review boards.
7. Platform Observability
Push run data into dashboards. Correlate:
- GitHub Actions runs
- Evaluation scores
- Production telemetry from Foundry
Visualization options: Azure Monitor, Power BI, Grafana.
Aim for a CI/CD + AI Ops Console view — one pane to observe quality, reliability, and speed.
Outcome of These Practices
Your organization achieves:
- Consistency across microservices and AI systems
- Accountability through versioned artifacts
- Safety via evaluation gates and drift monitors
- Agility because updates remain fast, but protected
Enterprise Scenarios
By this point, you’ve built an end‑to‑end platform:
- standardized CI/CD for apps and agents,
- reusable GitHub Actions workflows,
- Azure runtime for reliable deployments,
- Foundry‑integrated evaluation gates.
Now let’s see how this architecture performs in the wild.
Scenario 1 — Fifty Microservices, One Consistent Pipeline
Problem Statement
At scale, each microservice team usually maintains a slightly different workflow — fragmented test tools, drift in Python or Node versions, duplicated YAML.
What Goes Wrong
- Compliance updates require 50 PRs.
- Each team solves build problems differently.
- Security teams can’t easily prove consistency.
Platform Solution
- The ci-platform repo defines all workflows once (test‑python.yml, build.yml, deploy.yml).
- Every service just calls them through uses:.
- Upgrading the base image or CI version happens once and propagates to all services.
Result
- Full organization upgrade from Python 3.10→3.11 in minutes.
- Consistent quality gates, policies, and artifact naming.
- Reduced cycle time, increased deployment confidence.
Scenario 2 — Regulated Enterprises (Compliance + Audit)
Problem Statement
Financial, healthcare, and government projects require strict controls:
- Auditable promotion paths
- Approval workflows
- Traceability of versions and changes
What Goes Wrong
- Manual change reviews are error‑prone.
- Different CI/CD definitions per team produce inconsistent logs.
- Compliance reports take weeks.
Platform Solution
- GitHub Environments provide built‑in approvals and reviewer rules.
- The same reusable workflows ensure identical build signatures.
- Foundry integration logs evaluation scores and deployment metadata automatically.
Result
- Reviewers approve through GitHub’s Environment gate — zero custom UI needed.
- Each release carries an immutable commit ID + evaluation score + approvers record.
- Audit reports generate directly from pipeline history.
Scenario 3 — AI‑Driven Customer Support Platform
Problem Statement
A company running customer support agents (GPT‑powered) wants to continuously improve responses but without risking live quality drops.
What Goes Wrong
- Prompt changes can silently worsen accuracy.
- Model updates impact intent coverage.
- Hard to correlate user feedback with deployment versions.
Platform Solution
- Add evaluate-agent.yml into the same CI/CD chain.
- Feed evaluation datasets that cover FAQs and tone guidelines.
- Require minimum score ≥ 0.85 for promotion.
- Deploy via Foundry to production clusters once threshold met.
- Stream Foundry telemetry → GitHub → Power BI for quality dashboards.
Result
- Continuous prompt experimentation without sacrificing quality.
- Regressed builds automatically blocked.
- Business stakeholders track AI accuracy as a live metric.
Bonus Scenario — Enterprise AI R&D Platform
Multiple research teams train models on‑prem or in Azure ML. The central engineering platform exposes build, evaluate, deploy steps as reusable workflows.
- Data scientists → run “evaluate‑agent” without touching infra.
- Platform engineers → control policies, thresholds, approvals.
- Leadership → gets consistent reporting on AI performance and cost.
This creates a single standard for AI lifecycle governance across business units.
Summary
Your platform now supports:
| Area | Traditional Dev | AI Adaptation |
|---|---|---|
| Build & Test | Reusable workflows (Services) | Evaluation gate (Agents) |
| Deploy | Container Apps / GitHub Environments | Foundry + Telemetry Feedback |
| Governance | Environment approval rules | Evaluation threshold + human review |
| Scaling | One repo per service | One platform per organization |
Across these cases, the core pattern holds:
Centralize workflow logic, decentralize application logic, unify governance.
14 — Conclusion
What began as a simple effort to clean up a few duplicated YAML files evolved into a complete delivery platform architecture — one that treats pipelines as first‑class products and extends their usefulness into the era of AI‑driven systems.
From Pipelines to Platforms
At first, you built reusable workflows in a shared repository.
That small structural change produced an outsized effect:
- Reduced maintenance and drift
- Consistent security and compliance
- One‑click upgrades across every service
You proved that pipeline logic belongs in its own product — a CI/CD platform.
From Deterministic to Intelligent Delivery
Then the domain changed. Deterministic services gave way to AI agents.
You responded by extending the same reusable platform into the AI dimension:
- Added evaluate-agent.yml for semantic scoring
- Introduced Foundry as the runtime for intelligent components
- Unified evaluation, governance, and deployment under the same contracts
The underlying philosophy remained identical: don’t duplicate delivery logic — standardize it.
The Broader Pattern
This architecture expresses a clear maturity pathway:
| Stage | What Changes | Technical Lever |
|---|---|---|
| CI/CD as Automation | Build pipelines per project | YAML and Actions |
| CI/CD as Product | Reusable workflows, shared logic | Platform Repo |
| CI/CD as Governance | Environments, approvals, tracking | GitHub Environments + Azure |
| AI Delivery Platform | Evaluation + behavioral policy | Foundry Integration |
Every step adds structure, traceability, and scale, without sacrificing developer velocity.
Cultural Impact
Moving to a platform model does more than streamline releases.
It elevates DevOps to a product discipline:
- Platform engineers design contracts, not scripts.
- Application teams consume delivery APIs, not ad‑hoc builds.
- AI teams get reliable evaluation and rollback mechanisms.
In short: velocity meets governance.
The Next Frontier
As this pattern matures, two frontiers are emerging:
- Autonomous Evaluation — Agents that assess other agents in continuous feedback loops.
- Dynamic Policy Enforcement — Pipelines that adjust deployment thresholds and configurations in real time based on observed performance.
The foundations you’ve built — centralized workflows, evaluation gates, and Foundry integration — already support that trajectory.
CI/CD maturity is not about writing workflows; it’s about designing reusable systems of workflows.
What you’ve built is more than CI/CD. It’s a platform that defines how modern software and AI move from idea to production safely.
Next we’ll close the series with a brief “What’s Next” section — outlining concrete next steps for building upon this foundation. Continue?
go ahead
Here’s the final piece — Section 15 — What’s Next, which closes the series with momentum and a clear roadmap for readers who want to replicate or extend what you’ve built.
15 — What’s Next
You’ve gone from writing pipelines to designing platforms.
The CI/CD model you created now governs the lifecycle of both microservices and AI agents — and it’s only the beginning.
Step 1 — Publish Your Platform
- Make both repositories public (read‑only) so others can learn from the pattern:
- ns-github-design/ci-platform – your reusable workflow product
- ns-github-design/fastapi-app – your minimal consumer example
- Tag the current stable version as v1.0 in both repos.
- Add concise READMEs explaining purpose, usage, and version policy.
This turns your repos into live documentation — a working reference architecture.
Step 2 — Add Automated Docs and Visuals
- Export your Draw.io architecture to SVG and embed it in each README.
- Use GitHub Pages or Docsify to render a small site explaining:
- platform repo overview;
- how workflow_call works;
- how to set up Azure auth;
- example runs and outputs.
Readers love code + architecture in one place.
Step 3 — Extend to AI Agents
Add a third demo:
agent-evaluator — a lightweight agent that runs eval.py and demonstrates the evaluation gate.
In that repo:
- Call evaluate-agent.yml from your platform.
- Push commits that sometimes fail thresholds.
- Show screenshots of blocked vs. approved runs.
You’ll have a fully working AI evaluation demo powered by your platform.
Step 4 — Instrument Foundry Feedback
Use Foundry’s APIs to stream live evaluation results or observability data back into GitHub Actions artifacts:
yaml
- name: Collect Foundry feedback run: foundry metrics export --project my-ai-agent --output metrics.json
That feedback loop will let you build dashboards of quality trends alongside deployment timeline.
Step 5 — Prepare Part 3 (Next Blog)
You now have a natural foundation for the next article:
“Autonomous Delivery Loops: Continuous Evaluation and Guardrails for AI Agents.”
Outline:
- Continuous evaluation with scheduled runs
- Self‑healing approval flows
- Dynamic policy adjustment based on metrics
- Cross‑team Governance as Code
That installment makes your series visionary and future‑ready.
Quick Recap
| Phase | Achievement |
|---|---|
| 1 – 4 | Built CI/CD Platform + App Repo |
| 5 | Configured Azure + OIDC |
| 6 | Verified Pipeline End‑to‑End |
| 8 – 15 | Documented Demo → AI Integration → Enterprise Practices → Vision |
You now have a complete blog series that is:
- technically deep,
- architecturally unique,
- demonstrably real.
Every diagram, YAML, and code sample came from a working, reproducible system — the hallmark of strong engineering writing.
Final Thought
Software delivery used to end at deployment.
AI delivery begins there.
The future of platforms is not just to ship software faster — but to ensure that every agent behaves as designed.