Blog Post

Microsoft Foundry Blog
8 MIN READ

How Veris AI and Lume Security built a self-improving AI agent with Microsoft Foundry

andi-veris's avatar
andi-veris
Copper Contributor
Mar 30, 2026

AI agents are slowly moving from demos into production, where real users, messy systems, and long-tail edge cases lead to new unseen failure modes. We show how a high-fidelity simulation environment built by Veris AI on Microsoft Azure can expand production failures into families of realistic scenarios, generating enough targeted data to optimize agent behavior through automated context engineering and reinforcement learning, all while not regressing on any previous issue. We demonstrate this on a security agent built by Lume Security on Microsoft Foundry. 

Introduction

AI agents are slowly moving from demos into production, where real users, messy systems, and long-tail edge cases lead to new unseen failure modes. Production monitoring can surface issues, but converting those into reliable improvements is slow and manual, bottlenecked by the low volume of repeatable failures, engineering time, and risk of regression on previously working cases.

We show how a high-fidelity simulation environment built by Veris AI on Microsoft Azure can expand production failures into families of realistic scenarios, generating enough targeted data to optimize agent behavior through automated context engineering and reinforcement learning, all while not regressing on any previous issue. We demonstrate this on a security agent built by Lume Security on Microsoft Foundry

Lume creates an institutionally grounded security intelligence graph that captures how organizations actually work—this intelligence graph then powers deterministic, policy-aligned agents that reason alongside the user and take trusted action across security, compliance, and IT workflows.

Lume helps modern security teams scale expertise without scaling headcount. Its Security Intelligence Platform builds a continuously learning security intelligence graph that reflects how work actually gets done. It reasons over policies, prior tickets, security findings, tool configurations, documentation, and system activity to retain institutional memory and drive consistent response. On this foundation, Lume delivers context-aware security assistants that fetch the most relevant context at the right time, produce policy-aligned recommendations, and execute deterministic actions with full explainability and audit trails. These assistants are embedded in tools teams are already using like ServiceNow, Jira, Slack, and Teams so the experience is seamless and provides value from day one.

Microsoft Foundry and Veris AI together provide Lume with a secure, repeatable control plane that makes model iteration, safety, and simulation-driven evaluation practical.

  • Vendor flexibility. Swap or test different models from OpenAI, Anthropic, Meta, and many others, with no infra changes.
  • Fast model rollout. Provide access to the latest models as soon as they are released, making experimentations and updates easy.
  • Consistent safety. Built-in policy and guardrail tooling enforces the same checks across experiments, cutting bespoke guardrail work.
  • Enterprise privacy. Models run in private Azure instances and are not trained on client data, which simplifies and shortens AI security reviews.
  • Made evaluation practical. Centralized models, logging, and policies let Veris run repeatable simulations and feed targeted failure cases back into Lume’s improvement loop.
  • Simulation-driven evaluation. Run repeatable, high-fidelity simulations to stress-test and automatically surface failure modes before production.
  • Agent optimization. Turn the graded failures into upgrades through automatic prompt fixes and targeted fine-tuning/RFT.

The Lume Solution: Contextual Intelligence for Security Team Members

Security teams are burdened by the time-consuming process of gathering necessary context from various siloed systems, tools, and subject-matter experts. This fragmented approach creates significant latency in decision-making, leads to frequent escalations, and drives up operational costs. Furthermore, incomplete or inaccurate context often results in inconsistent responses and actions that fail to fully mitigate risks.

Lume unifies an organization’s fragmented data sources into a security intelligence graph. It also provides purpose-built assistants, embedded in tooling the team already uses, that can fetch relevant content from across the org, produce explainable recommendations, execute actions with full audit trails, and update data sources when source context is missing or misleading. The result is faster, more consistent decisions and fewer avoidable escalations. With just a couple integrations into ticketing and collaboration tools, Lume begins to prioritize where teams can gain the most efficiencies and risk reduction from context intelligence and automation - and automatically builds out assistants that can help. 

  • Search. Aggregate prior requests, owners, org, access, policy, live intel, and SIEM.
  • Plan. Produce an institutionally grounded action plan.
  • Review. Present a single decision-ready view with full context to approvers, modifying the plan based on their input.
  • Act. Execute pre-approved changes with full explainability and audit trail.
  • Close the loop. Notify stakeholders, update the graph and docs, log outcomes.

 

 

Validation with early customers has revealed Lume’s security intelligence and assistants potential to truly change the way enterprises deal with security analysis and tasks.

  1. 35–55% less time on routine requests. Measurements with early customers shows the assistant and access to institutional intelligence reduces the time security analysts spend on recurring intake and tactical tasks, freeing staff for higher-value work.
  2. Faster and more confident decision making. Qualitative feedback from security team members reveals they feel more confident and can act faster when using the assistant, while also feeling less burdened knowing Lume will help ensure the task is resolved and documented.
  3. Improved institutional memory. Every decision, rationale, and action is captured and surfaced in the security intelligence graph, increasing repeatability and reducing future rework—this captured information also updates and cleanses existing documentation to ensure institutional knowledge aligns with current practice.
  4. Proactively identify & prioritize opportunities. Lume first identifies and prioritizes the tasks and processes where intelligence and assistants can have the greatest impact—measuring the actual ROI for security teams.
  5. Meaningful deflection of routine requests. The assistant can respond, plan, and execute actions (with human review + approval), deflecting common escalations and reducing load on subject matter experts. 

Architected on Azure

By implementing this system on Azure stack, Lume and Veris benefited from the flexibility and breadth of services enabling this large scale implementation. This architecture allows the agent to evolve independently across reasoning, retrieval, and action layers without destabilizing production behavior.

  • Microsoft Foundry orchestrates model usage, prompt execution, safety policies, and evaluation hooks.
  • Azure AI Search provides hybrid retrieval across structured documents, policies, and unstructured artifacts.
  • Vector storage enables semantic retrieval of prior tickets, decisions, and organizational knowledge.
  • Graph databases capture relationships between systems, controls, owners, and historical decisions, allowing the agent to reason over organizational structure rather than isolated facts.
  • Azure Kubernetes Service (AKS) hosts the agent runtime and evaluation workloads, enabling horizontal scaling and isolated experiments.
  • Azure Key Vault manages secrets, credentials, and API access securely.
  • Azure Web App Services power customer-facing interfaces and internal dashboards for visibility and control.

Evaluating and ultimately improving agents is still one of the hardest parts of the stack. Static datasets don’t solve this, because agents are not static predictors. They are dynamic, multi-turn systems that change the world as they act: they call tools, accumulate state, recover (or fail to recover) from errors, and face nondeterministic user behavior. A golden dataset might grade a single response as “correct,” while completely missing whether the agent actually achieved the right end state across systems. In other words: static evals grade answers; agent evals must grade outcomes.

At the same time, getting enough real-world data to drive improvement is perpetually difficult. The most valuable failures are rare, long-tail, and hard to reproduce on demand. Iterating directly in production is not possible because of safety, compliance, and customer-experience risk. That’s why environments are essential: a place to test agents safely, generate repeatable experience, and create the signals that allows developers to improve behavior without gambling on live users.

Veris AI is built around that premise. It is a high-fidelity simulation platform that models the world around an agent with mock tools, realistic state transitions, and simulated users with distinct behaviors. From a single observed production failure, Veris can reconstruct what happened, expand it into a family of realistic scenario variants, and then stress-test the agent across that scenario set. Those runs are evaluated with a mix of LLM-based judges and code-based verifiers that score full trajectories not just the final text.

Crucially, Veris does not stop at measurement. Its optimization module uses those grader signals to improve the agent with automatically refining prompts and supporting reinforcement-learning style updates (e.g., RFT) in a closed loop. The result is a workflow where one production incident can become a repeatable training and regression suite, producing targeted improvements on the failure mode while protecting performance on everything that already worked.

 

Veris AI environment is available on Azure Kubernetes Service (AKS) and can be easily deployed in a user Virtual Network. 

Under the Hood: Building a self-improving agent

When a failure appears in production, the improvement loop starts from the trace. Veris takes the failed session logs from the observability pipeline, then an LLM-based evaluator pinpoints what actually went wrong and automatically writes a new targeted evaluation rubric for that specific failure mode. From there, Veris simulator expands the single incident into a scenario family, dozens of realistic variants with branching, so the agent can be trained and re-tested against the whole “class” of the failure, not just one example. This is important as failure modes are often sparse in the real-world. Those scenarios are executed in the simulation engine with mocked tools and simulated users, producing full interaction traces that are then scored by the evaluation engine.

The scores become the training signal for optimization. Veris optimization engine uses the evaluation results, the original system prompt, and the new rubric to generate an updated prompt designed to fix the failure without breaking previously-good behavior. Then it validates the new prompt on both (a) the specialized scenario set created from the incident and (b) a broader regression held-out set suite to ensure the improvement generalizes and does not cause regressions.

In this case study, we focused on a key failure mode of incorrect workflow classification at the triage step for approval-driven access requests. In these cases, tickets were routed into an escalation or in-progress workflow instead of the approval-validation path. The ticket often contained conflicting approval or rejection signals in human comments - manager approved but they required some additional information, a genuine occurrence in org related jira workflows. The triage agent failed to recognize them and misclassified the workflow state. Since triage determines what runs next, a single misclassification at the start was enough to bypass the Approval Agent and send the request down an incorrect downstream path.

Results & Impact

By running the optimization loop, we were able to improve agent accuracy on this issue by over 40%, while not regressing on any other correct behavior. We ran experiments on a dataset only containing scenarios with the issue and a more general dataset encompassing a variety of scenarios. The experiment results on both datasets and updated prompt is shown below.

 

Continuing with this over any issues that arise in production or simulation will continuously improve the agent performance. 

Takeaways

This collaboration shows a practical pattern for taking an enterprise agent from “works in a demo” to “improves safely in production”, pairing an orchestration layer that standardizes model usage, safety, logging, and evaluation with a simulation environment where failures can be reproduced and fixed without risking real users, then making it repeatable in practice. In this stack, Veris AI provides the simulations, trajectory grading, and optimization loop, while Microsoft Foundry operationalizes the workflow with vendor-flexible model iteration, consistent policy enforcement, enterprise privacy, and centralized evaluation hooks that turn testing into a first-class system instead of bespoke glue code.

The result is an improved and reliable Lume assitant that can help enterprises spend 35–55% less time on routine requests, and meaningfully deflect repetitive escalations, requiring fewer clarification cycles, enabling faster response times, and a stronger institutional memory where decisions and rationales compound over time instead of getting lost across tools and tickets.

The self-improving loop continuously improves the agent, starting from a production trace, by generating a targeted rubric, expanding the incident into a scenario family, running it end-to-end with mocked tools and simulated users, scoring full trajectories, then using those scores to produce a safer prompt/model update and validating it against both the failure set and a broader regression suite. This turns rare long-tail failures into repeatable training and regression assets.

If you’re building an AI agent, the recommendation is straightforward. Invest early in an orchestration and safety layer, as well as an environment-driven evaluation that can create an improvement loop to ship fixes without regressions. This way, your production failures act as the highest-signal input to continuously harden the system.


To learn more or start building, teams can explore the Veris Console for free or browse the Veris documentation . In this stack, Microsoft Foundry provides the orchestration and safety control plane, and Veris provides the simulation, evaluation, and optimization loop needed to make agents improve safely over time. Learn more about the Lume Security assistant here.

Updated Mar 24, 2026
Version 1.0
No CommentsBe the first to comment