Forum Discussion
Turning “cool agent demos” into accountable systems – how are you doing this in Azure AI Foundry?
Hi everyone,
I’m working with customers who are very excited about the new agentic capabilities in Azure AI Foundry (and the Microsoft Agent Framework). The pattern is always the same:
Building a cool agent demo is easy.
Turning it into an accountable, production-grade system that governance, FinOps, security and data people are happy with… not so much.
I’m curious how others are dealing with this in the real world, so here’s how I currently frame it with customers and I’d love to hear where you do things differently or better.
Governance: who owns the agent, and what does “safe enough” mean?
- For us, an agent is not “just another script”. It’s a proper application with:
- An owner (a real person, not a team name).
- A clear purpose and scope.
- A policy set (what it can and cannot do).
- A minimum set of controls (access, logging, approvals, evaluation, rollback).
In Azure AI Foundry terms: we try to push as much as possible into “as code” (config, infra, CI/CD) instead of burying it in PowerPoint and Word docs.
The litmus test I use: if this agent makes a bad decision in production, can we show – to audit or leadership – which data, tools, policies and model versions were involved? If the answer is “not really”, we’re not done.
FinOps: if you can’t cap it, you can’t scale it
Agentic solutions are fantastic at chaining calls and quietly generating cost.
We try to design with:
Explicit cost budgets per agent / per scenario.
A clear separation between “baseline” workloads and “burst / experimentation”.
Observability on cost per unit of value (per ticket, per document, per transaction, etc.).
Some of this maps nicely to existing cloud FinOps practices, some feels new because of LLM behaviour. My personal rule: I don’t want to ship an agent to production if I can’t explain its cost behaviour in 2–3 slides to a CFO.
Data, context and lineage: where most of the real risk lives
In my experience, most risk doesn’t come from the model, but from: Which data the agent can see.
How fresh and accurate that data is. Whether we can reconstruct the path from data → answer → decision.
We’re trying to anchor on:
Data products/domains as the main source of truth.
Clear contracts around what an agent is allowed to read or write.
Strong lineage for anything that ends up in front of a user or system of record.
From a user’s point of view, “Where did this answer come from?” is quickly becoming one of the most important questions.
GreenOps / sustainability: starting to show up in conversations
Some customers now explicitly ask:
“What is the energy impact of this AI workload?”
“Can we schedule, batch or aggregate work to reduce energy use and cost?”
So we’re starting to treat GreenOps as the “next layer” after cost: not just “is it cheap enough?”, but also “is it efficient and responsible enough?”.
What I’d love to learn from this community:
- In your Azure AI Foundry/agentic solutions, where do governance decisions actually live today?
Mostly in documentation and meetings, or do you already have patterns for policy-as-code / eval-as-code? - How are you bringing FinOps into the design of agents?
Do you have concrete cost KPIs per agent/scenario, or is it still “we’ll see what the bill says”? - How are you integrating data governance and lineage into your agent designs?
Are you explicitly tying agents to data products/domains with clear access rules? Any “red lines” for data they must never touch? - Has anyone here already formalised “GreenOps” thinking for AI Foundry workloads?
If yes, what did you actually implement (scheduling, consolidation, region choices, something else)? - And maybe the most useful bit: what went wrong for you so far?
Without naming customers, obviously. Any stories where a nice lab pattern didn’t survive contact with governance, security or operations?
I’m especially interested in concrete patterns, checklists or “this is the minimum we insist on before we ship an agent” criteria. Code examples are very welcome, but I’m mainly looking for the operating model and guardrails around the tech.
Thanks in advance for any insights, patterns or war stories you’re willing to share.
1 Reply
hi MartijnMuilwijk This is a great topic, and honestly one of the hardest parts of moving from agent demos to real value.
I’ve seen the exact same pattern you describe: teams can build something impressive in days, but the moment you say “this is going to production,” governance, security, finance, and audit all show up at once — usually with very reasonable questions that the demo never had to answer.
A few thoughts from what we’re seeing in Azure AI Foundry–based projects.
Governance: treating agents as first-class workloads (not experiments)
What’s worked best for us is explicitly classifying agents as applications, not AI experiments.
That means:
- Every agent has:
- A named business owner (not just a dev team)
- A documented purpose and non-goals (“what this agent must never do”)
- An approval path to go live
- The agent lifecycle (build → test → deploy → retire) is aligned with existing app governance, not a parallel “AI process”
In practice, we push governance into:
- IaC / policy-as-code (resource scopes, network rules, managed identities, tool access)
- Eval-as-code (baseline evaluations that must pass before promotion)
- Versioned configs (model, prompt, tools, data sources are all traceable)
Your litmus test resonates strongly. If we can’t reconstruct why an agent responded the way it did — model version, data source, tools invoked — we don’t call it production-ready.
FinOps: cost guardrails early, not after the bill shock
Agentic systems are deceptively expensive because:
- Tool chaining hides cost
- Retries and reasoning depth compound quickly
- “Just one more call” becomes the norm
A few patterns that helped:
- Per-agent budgets enforced at the platform level (not just advisory)
- Hard separation between:
- “Always-on” agents
- “Exploratory / burst” agents
- Cost telemetry aligned to business units of value:
- cost per ticket
- cost per document processed
- cost per workflow completion
Like you said, if we can’t explain an agent’s cost behavior to a finance leader in a couple of slides, it’s not ready.
Data, context, and lineage: where trust is won or lost
In real deployments, the model is rarely the biggest risk — data access is.
What’s been effective:
- Tying agents explicitly to data products or domains, not raw data stores
- Clear read/write contracts per agent
- Treating tool access as privileged operations (especially anything that mutates state)
We also log:
- Which data sources were consulted
- What tools were invoked
- What outputs were generated
That makes “Where did this answer come from?” answerable — not perfectly, but credibly.
GreenOps: emerging, but increasingly real
This is still early, but it’s coming up more often.
Initial steps we’ve seen:
- Scheduling non-urgent agent workloads
- Batching similar requests
- Being intentional about region selection (latency and energy efficiency)
- Avoiding always-on agents when event-driven works just as well
Most customers aren’t measuring energy impact yet, but they are starting to ask the question — which usually means it’ll become a requirement sooner rather than later.
What tends to go wrong (lessons learned)
A few recurring themes:
- Lab agents that had unrestricted data access and no one noticed until security review
- Costs that scaled linearly in testing but exponentially in production
- “Temporary” prompts and tools that became permanent without review
- Ownership gaps (“Who is actually responsible for this agent?”)
The biggest lesson: retro-fitting governance is much harder than designing for it.
A simple “minimum bar” before shipping an agent
What we increasingly insist on:
- Named owner and documented purpose
- Defined data access boundaries
- Basic evaluation and monitoring
- Cost visibility and limits
- Rollback plan
Nothing exotic — just enough structure so the agent survives contact with reality.
Thanks for starting this discussion. I’d love to see more shared checklists and reference architectures from the community — especially examples where things didn’t go as planned. That’s usually where the most learning happens.