ai
48 TopicsWhat's New in Microsoft EDU - March 2026
Join us on Wednesday, March 25th, 2026 for our latest "What's New in Microsoft EDU" webinar! We will be covering all of the latest product updates from Microsoft Education. These 30-minute webinars are put on by the Microsoft Education Product Management group and happen once per month, this month both 8:00am Pacific Time and 4:00pm Pacific time to cover as many global time zones as possible around the world. And don’t worry – we’ll be recording these and posting on our Microsoft Education YouTube channel in the new “What’s New in Microsoft EDU” playlist, so you’ll always to able to watch later or share with others! Here is our March 2026 webinar agenda: 1) M365 Copilot and AI updates for Educators and Students - Modify Existing Content - Minecraft EDU Lesson Plans - New Learning Activities: Fill in the Blanks, Matching and Self-Quizzing - Study & Learn agent for studnets 2) Learning Zone General Availability and the Copilot+ PC 3) Microsoft 365 LTI and Teach Module for Learning Management Systems 4) AMA - Ask Microsoft EDU Anything (Q&A) We look forward to having you attend the event! How to sign up 📅 OPTION 1: March 25th, Wednesday @ 8:00am Pacific Time Register here 📅 OPTION 2: March 25th, Wednesday @ 4:00pm Pacific Time Register here This is what the webinar portal will look like when you register: We look forward to seeing you there! Mike Tholfsen Group Product Manager Microsoft Education135Views1like0CommentsBuild a Fully Offline RAG App with Foundry Local: No Cloud Required
A practical guide to building an on-device AI support agent using Retrieval-Augmented Generation, JavaScript, and Microsoft Foundry Local. The Problem: AI That Can't Go Offline Most AI-powered applications today are firmly tethered to the cloud. They assume stable internet, low-latency API calls, and the comfort of a managed endpoint. But what happens when your users are in an environment with zero connectivity a gas pipeline in a remote field, a factory floor, an underground facility? That's exactly the scenario that motivated this project: a fully offline RAG-powered support agent that runs entirely on a laptop. No cloud. No API keys. No outbound network calls. Just a local model, a local vector store, and domain-specific documents all accessible from a browser on any device. The Gas Field Support Agent - running entirely on-device What is RAG and Why Should You Care? Retrieval-Augmented Generation (RAG) is a pattern that makes language models genuinely useful for domain-specific tasks. Instead of hoping the model "knows" the answer from pre-training, you: Retrieve relevant chunks from your own documents Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result: fewer hallucinations, traceable answers, and an AI that works with your content. If you're building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. Why fully offline? Data sovereignty, air-gapped environments, field operations, latency-sensitive workflows, and regulatory constraints all demand AI that doesn't phone home. Running everything locally gives you complete control over your data and eliminates any external dependency. The Tech Stack This project is deliberately simple — no frameworks, no build steps, no Docker: Layer Technology Why AI Model Foundry Local + Phi-3.5 Mini Runs locally, OpenAI-compatible API, no GPU needed Backend Node.js + Express Lightweight, fast, universally known Vector Store SQLite via better-sqlite3 Zero infrastructure, single file on disk Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Frontend Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is just four npm packages: express , openai , foundry-local-sdk , and better-sqlite3 . Architecture Overview The system has five layers — all running on a single machine: Five-layer architecture: Client → Server → RAG Pipeline → Data → AI Model Client Layer — A single HTML file served by Express, with quick-action buttons and responsive chat Server Layer — Express.js handles API routes for chat (streaming + non-streaming), document upload, and health checks RAG Pipeline — The chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorization Data Layer — SQLite stores document chunks and their TF-IDF vectors; source docs live as .md files AI Layer — Foundry Local runs Phi-3.5 Mini Instruct on CPU/NPU, exposing an OpenAI-compatible API Getting Started in 5 Minutes You need two prerequisites: Node.js 20+ — nodejs.org Foundry Local — Microsoft's on-device AI runtime: Terminal winget install Microsoft.FoundryLocal Then clone, install, ingest, and run: git clone https://github.com/leestott/local-rag.git cd local-rag npm install npm run ingest # Index the 20 gas engineering documents npm start # Start the server + Foundry Local Open http://127.0.0.1:3000 and start chatting. Foundry Local auto-downloads Phi-3.5 Mini (~2 GB) on first run. How the RAG Pipeline Works Let's trace what happens when a user asks: "How do I detect a gas leak?" RAG query flow: Browser → Server → Vector Store → Model → Streaming response Step 1: Document Ingestion Before any queries happen, npm run ingest reads every .md file from the docs/ folder, splits each into overlapping chunks (~200 tokens, 25-token overlap), computes a TF-IDF vector for each chunk, and stores everything in SQLite. Chunking example docs/01-gas-leak-detection.md → Chunk 1: "Gas Leak Detection – Safety Warnings: Ensure all ignition..." → Chunk 2: "...sources are eliminated. Step-by-step: 1. Perform visual..." → Chunk 3: "...inspection of all joints. 2. Check calibration date..." The overlap ensures no information falls between chunk boundaries — a critical detail in any RAG system. Step 2: Query → Retrieval When the user sends a question, the server converts it into a TF-IDF vector, compares it against every stored chunk using cosine similarity, and returns the top-K most relevant results. For 20 documents (~200 chunks), this executes in under 10ms. src/vectorStore.js /** Retrieve top-K most relevant chunks for a query. */ search(query, topK = 5) { const queryTf = termFrequency(query); const rows = this.db.prepare("SELECT * FROM chunks").all(); const scored = rows.map((row) => { const chunkTf = new Map(JSON.parse(row.tf_json)); const score = cosineSimilarity(queryTf, chunkTf); return { ...row, score }; }); scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK).filter((r) => r.score > 0); } Step 3: Prompt Construction The retrieved chunks are injected into the prompt alongside system instructions: Prompt structure System: You are an offline gas field support agent. Safety-first... Context: [Chunk 1: Gas Leak Detection – Safety Warnings...] [Chunk 2: Gas Leak Detection – Step-by-step...] [Chunk 3: Purging Procedures – Related safety...] User: How do I detect a gas leak? Step 4: Generation + Streaming The prompt is sent to Foundry Local via the OpenAI-compatible API. The response streams back token-by-token through Server-Sent Events (SSE) to the browser: Safety-first response with structured guidance Expandable sources with relevance scores Foundry Local: Your Local AI Runtime Foundry Local is what makes the "offline" part possible. It's a runtime from Microsoft that runs small language models (SLMs) on CPU or NPU — no GPU required. It exposes an OpenAI-compatible API and manages model downloads, caching, and lifecycle automatically. The integration code is minimal if you've used the OpenAI SDK before, this will feel instantly familiar: src/chatEngine.js import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; // Start Foundry Local and load the model const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Use the standard OpenAI client — pointed at the local endpoint const client = new OpenAI({ baseURL: manager.endpoint, apiKey: manager.apiKey, }); // Chat completions work exactly like the cloud API const stream = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ], stream: true, }); Portability matters Because Foundry Local uses the OpenAI API format, any code you write here can be ported to Azure OpenAI or OpenAI's cloud API with a single config change. You're not locked in. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. We chose TF-IDF for this project because: Fully offline — no embedding model to download or run Zero latency — vectorization is instantaneous (just math on word frequencies) Good enough — for a curated collection of 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent — you can inspect the vocabulary and weights, unlike neural embeddings For larger collections (thousands of documents) or when semantic similarity matters more than keyword overlap, you'd swap in an embedding model. But for this use case, TF-IDF keeps the stack simple and dependency-free. Mobile-Responsive Field UI Field engineers use this app on phones and tablets often wearing gloves. The UI is designed for harsh conditions with a dark, high-contrast theme, large touch targets (minimum 48px), and horizontally scrollable quick-action buttons. Desktop view Mobile view The entire frontend is a single index.html file — no React, no build step, no bundler. This keeps the project accessible and easy to deploy anywhere. Runtime Document Upload Users can upload new documents without restarting the server. The upload endpoint receives markdown content, chunks it, computes TF-IDF vectors, and inserts the chunks into SQLite — all in memory, immediately available for retrieval. Drag-and-drop document upload with instant indexing Adapt This for Your Own Domain This project is a scenario sample designed to be forked and customized. Here's the three-step process: 1. Replace the Documents Delete the gas engineering docs in docs/ and add your own .md files with optional YAML front-matter: docs/my-procedure.md --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the System Prompt Open src/prompts.js and rewrite the instructions for your domain: src/prompts.js export const SYSTEM_PROMPT = `You are an offline support agent for [YOUR DOMAIN]. Rules: - Only answer using the retrieved context - If the answer isn't in the context, say so - Use structured responses: Summary → Details → Reference `; 3. Tune the Retrieval Adjust chunking and retrieval parameters in src/config.js : src/config.js export const config = { model: "phi-3.5-mini", chunkSize: 200, // smaller = more precise, less context per chunk chunkOverlap: 25, // prevents info from falling between chunks topK: 3, // chunks per query (more = richer context, slower) }; Extending to Multi-Agent Architectures Once you have a working RAG agent, the natural next step is multi-agent orchestration where specialized agents collaborate to handle complex workflows. With Foundry Local's OpenAI-compatible API, you can compose multiple agent roles on the same machine: Multi-agent concept // Each agent is just a different system prompt + RAG scope const agents = { safety: { prompt: safetyPrompt, docs: "safety/*.md" }, diagnosis: { prompt: diagnosisPrompt, docs: "faults/*.md" }, procedure: { prompt: procedurePrompt, docs: "procedures/*.md" }, }; // Router determines which agent handles the query function route(query) { if (query.match(/safety|warning|hazard/i)) return agents.safety; if (query.match(/fault|error|code/i)) return agents.diagnosis; return agents.procedure; } // Each agent uses the same Foundry Local model endpoint const response = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: selectedAgent.prompt }, { role: "system", content: `Context:\n${retrievedChunks}` }, { role: "user", content: userQuery } ], stream: true, }); This pattern lets you build specialized agent pipelines a triage agent routes to the right specialist, each with its own document scope and system prompt, all running on the same local Foundry instance. For production multi-agent systems, explore Microsoft Foundry for cloud-scale orchestration when connectivity is available. Local-first, cloud-ready Start with Foundry Local for development and offline scenarios. When your agents need cloud scale, swap to Azure AI Foundry with the same OpenAI-compatible API your agent code stays the same. Key Takeaways 1 RAG = Retrieve + Augment + Generate Ground your AI in real documents — dramatically reducing hallucination and making answers traceable. 2 Foundry Local makes local AI accessible OpenAI-compatible API running on CPU/NPU. No GPU required. No cloud dependency. 3 TF-IDF + SQLite is viable For small-to-medium document collections, you don't need a dedicated vector database. 4 Same API, local or cloud Build locally with Foundry Local, deploy with Azure OpenAI — zero code changes. What's Next? Embedding-based retrieval — swap TF-IDF for a local embedding model for better semantic matching Conversation memory — persist chat history across sessions Multi-agent routing — specialized agents for safety, diagnostics, and procedures PWA packaging — make it installable as a standalone app on mobile devices Hybrid retrieval — combine keyword search with semantic embeddings for best results Get the code Clone the repo, swap in your own documents, and start building: git clone https://github.com/leestott/local-rag.git github.com/leestott/local-rag — MIT licensed, contributions welcome. Open source under the MIT License. Built with Foundry Local and Node.js.87Views0likes0CommentsThe Hidden Architecture of Nano Architectures
Why does the same prompt, on the same checkpoint, with temperature set to zero, sometimes produce a different answer only when the system is under real load? If you have ever watched token three flip and then watched the whole completion diverge, you already know this is not a product bug. It is a systems fact. Here is the thing. In production, you did not deploy a model. You deployed a runtime that selects an execution plan under constraints. The weights are inside that plan. The behavior is the plan. I’m Hazem Ali — Microsoft AI MVP, Distinguished AI and ML Engineer and Architect, and Founder and CEO of Skytells. I’ve built and led engineering work that turns deep learning research into production systems that survive real-world constraints. I speak at major conferences and technical communities, and I regularly deliver deep technical sessions on enterprise AI and agent architectures. If there’s one thing you’ll notice about me, it’s that I’m drawn to the deepest layers of engineering, the parts most teams only discover when systems are under real pressure. My specialization spans the full AI stack, from deep learning and system design to enterprise architecture and security. A rule I repeat in every serious review is simple. If you cannot explain the runtime, you do not understand the model you deployed. — Hazem Ali This is the next layer after my earlier deep dive on memory, KV cache, paging, and trust boundaries in The Hidden Memory Architecture of LLMs I also break down the memory-and-paging failure modes in When Your LLM Trips the MMU This one goes lower, into the execution that decides which math actually runs. When I Had to Prove It Live I still remember the first time I had to make this concrete in front of a room full of engineers. It was during a technical session I gave, and the question came up in the exact form you’ve probably heard before: Why does the same prompt on the same checkpoint, with temperature set to zero, sometimes produce a different answer only under real load? So I answered it the only way that holds up in a serious engineering room. I didn’t frame it as randomness. I framed it as execution. Not because it sounds cleaner, but because it is the only framing that survives scrutiny: under load, the system is not evaluating the same computation. In production, you don’t deploy weights in isolation. You deploy a runtime that selects an execution plan under constraints. Under load, the constraints change at token cadence: microbatch membership shifts, shapes shift, workspace feasibility tightens, and kernels or algorithms that were legal in the calm regime can become infeasible in the pressured regime. The runtime stays correct by contract, but it executes a different plan. And once the executed plan changes, reduction staging can change. When reduction staging changes, rounding happens at different points. That can move last bits. In decoding, last bits can become different tokens when early logit margins are thin. After the first token flips, divergence is expected because the context is different. That’s what I mean throughout this article when I say: The weights are inside the plan, but the behavior is the plan. What is Happening in Runtime Let’s start with the part most teams skip: the runtime pipeline from admission to a token. A production LLM server is not a function call. It is a control plane. And under real load, it behaves like one. It is not asking “what does the model say.” It is asking “what can I execute right now without breaking my guarantees.” Right now matters. Not in theory, in milliseconds. Because every decode step is a new scheduling event. The system does not commit to a single plan for the entire completion. It keeps re-evaluating feasibility as state shifts. What can I execute at this moment, with the VRAM I still have, on the hardware state I am currently in, while staying inside isolation boundaries and latency targets. That question is not answered once per request. It is answered repeatedly, at token cadence. The queue changes. The batch changes. Memory headroom changes. Cache residency changes. Workspace availability changes. The set of legal kernel and algorithm choices changes with them. And that is the point most people miss. The runtime is not just running your weights. It is continuously selecting an execution plan under constraint. The weights are inside that plan, but behavior lives in the selection. That selection is layered. Admission shapes the effective request. Scheduling forms the batch for this step. Kernel and algorithm choice binds the math that will actually run. Memory residency and allocation decide what is feasible. Isolation rules decide what sharing is allowed. Each layer contributes to the final plan, and the plan is what you are deploying. Admission and shaping Before your prompt ever reaches the model, it gets shaped. Truncation, policy injection, tool schema expansion, routing metadata, tenant tags, prefix reuse decisions, and safety transformations. If you do not know what I mean by effective request, I mean the exact token sequence that the model saw after shaping. That is the only input that matters for reproducibility. Batching and step level scheduling Modern servers do not just batch requests. They batch token steps. In a continuous batching system, token step timing feeds back into batching decisions. A slightly slower step changes who joins the next step. Who joins the next step changes shapes. Shapes change kernels. Kernels change numeric pathways. This is not an opinion. It is why vLLM exists. The PagedAttention paper describes serving as a batching problem where KV cache grows dynamically, wastes memory through fragmentation, and limits batch size. It introduces block level KV management and builds vLLM on top of it as an LLM serving system. Kernel plan selection and library behavior Once shapes are known, the runtime selects kernel variants and library algorithms that are feasible for those shapes and the workspace currently available. This is the part people underestimate. The same operator can have multiple valid implementations. The chosen implementation can change when workspace is tight, when shapes change, or when the engine wants to trade latency for throughput. Memory allocation and residency KV cache, activations, temporary buffers, workspace, graph memory, and communication buffers compete for VRAM. Under pressure, allocation patterns change. Fragmentation changes. Residency changes. Cache locality changes. All of that changes the system timeline and the feasible plan space. If you want a one line summary that is accurate in 2026 production inference, it is this. Inference is a scheduling problem plus a memory residency problem, and the model is inside that. The Scope First, Let me put it very clear. I am not claiming every deployment is nondeterministic. I am not claiming every kernel variant flips tokens. I am not claiming seeds are useless. I am making a narrower claim, the kind you can defend in an incident review without hand waving. Floating point math is not associative. Order matters. When you parallelize, you change the order of operations, and it is therefore valid for parallel results to differ from a sequential evaluation. NVIDIA states this directly in the CUDA C Best Practices Guide. CUDA also makes a foundational guarantee to the hardware and scheduler, not to your intuition. Thread blocks must be able to execute independently, in any order, in parallel or in series. That freedom is part of the programming model, not an edge case (ref). Now connect those two facts. If accumulation order changes, the last bits can change even when every operation is correct, because floating point addition is not associative. NVIDIA explicitly calls this out as well. Then layer in what serving stacks actually do. Production systems intentionally reshape execution through continuous batching and KV memory management. vLLM is a published example of this co design, where serving throughput is achieved by dynamic batching and memory-aware KV handling. Finally, bridge the nano to the semantic. When early logit margins are small, tiny numeric deltas can reorder the top candidates, and a single token flip is enough to diverge the entire completion. Here is the part that should feel a little scary, because it changes what you think you are operating. Under real load, the system is not just slower. It can enter a different execution regime. Batch composition shifts, shapes shift, workspace and residency shift, and the runtime is forced into a different set of legal kernel and algorithm choices. Nothing “breaks.” No bug is required. The system is still correct by contract. But your output is now a property of the regime you are in, not the demo you validated. That means you can pass every determinism test at idle and still ship a system that drifts only when it matters, at p95 and p99, when queues are long and memory headroom is tight. The first time you notice is often a user screenshot, an audit question, or an incident report where two replicas disagree on the same request because the runtime state was not the same. The equation principals should use in incident reviews Most teams ship with the demo mental model. y = f(x, θ) One prompt in, one checkpoint, one output. If the output changes, someone concludes the weights changed, or “AI is random.” That is not how production inference behaves, because production inference is not just a function. It is execution under constraint. Production behavior is closer to this. y = Decode( Exec(θ, x; s) ) θ is still the same weights. But the thing you actually shipped is Exec, and Exec is chosen. It is chosen per step, under the current state of the system. The behavior you observe is the behavior of the executed plan, not the abstract weights. X is not the prompt. X is the effective request. X is the exact token sequence the model saw after shaping. Truncation, policy injection, tool schema expansion, routing metadata, prefix reuse, safety transforms. All of that can change what the model actually receives. If you cannot reconstruct x, you are not replaying the request. You are replaying an approximation. Here is the minimum you should log for x, even if you cannot store raw text: # minimal "x" record: enough to reproduce or prove you cannot trace_x = { "req_id": req_id, "raw_prompt_sha256": sha256(raw_prompt), "effective_text_sha256": sha256(effective_text), "effective_tokens": len(effective_tokens), "truncated": truncated, "trunc_reason": trunc_reason, # e.g., "latency_guard", "context_cap" "decode_cfg_applied": decode_cfg, # temperature/top_p/max_tokens, etc. "shaping_events": events, # ["policy_inject:v3", "tool_schema:v2", ...] } S is not a vibe. S is the execution state that decides the math. S is what principals should demand in a postmortem, because this is what turns “it drifted” into “this plan executed under this regime.” At minimum, s includes: per-step batch composition and shape class queue delays and scheduling outcomes VRAM headroom and workspace availability cache pressure signals precision path and engine fallbacks distributed timeline signals (TP/PP latency, collective stalls) isolation posture (what batching is allowed) Why this matters: in continuous batching, time becomes part of semantics. A few milliseconds of delay changes who gets co-scheduled at the next token step. That changes shapes. Shapes change kernel/algorithm feasibility. Feasibility changes the numeric pathway. When early logit margins are thin, a tiny pathway delta is enough to flip the argmax. Here is a short, practical “s” record you can emit per decode step: # per-step "s" record: what plan ran, under what pressure step_s = { "req_id": req_id, "step": t, "batch_fp": sha256(",".join(sorted(batch_req_ids)))[:12], "shape": f"q=1,k={klen},h={heads},d={hidden},tp={tp}", "queue_ms": queue_ms, "gpu_ms": gpu_ms, "vram_free_mb": vram_free_mb, "workspace_free_mb": workspace_free_mb, "kv_regime": kv_regime, # "normal" | "pressured" | "paged" "precision_path": precision_path, # "bf16" | "fp16" | "tf32" | "fp32" "algo_id": algo_id, # backend/engine specific "kernel_variant": kernel_variant, # if available "isolation_mode": isolation_mode, # "shared" | "strict" } The incident-review translation If you only ask “what prompt did the user send” and “what weights did we run,” you are using the demo equation. You will argue about seeds, debate “randomness,” and never converge. The production equation forces the real question. Which plan executed, under which constraints, and what state pushed us into that plan. The line principals should repeat until teams internalize it is simple. Weights are static. Behavior is a property of the executed plan. And the executed plan depends on state. If you want one more operational layer that makes this feel real, add a regime marker. Regime changes are where “stability” collapses without any bug: def regime(vram_free_mb, paging_on, isolation_strict, queue_p95_ms): if isolation_strict: return "isolation_strict" if paging_on: return "paging" if vram_free_mb < 1024: return "memory_pressured" if queue_p95_ms > 50: return "queue_degraded" return "normal" When the regime changes, the feasible plan space changes. When the plan space changes, the executed math can change. That is the production reality your incident review must be able to explain. Floating point order is where small deltas are born Let’s break it down without hand waving. Finite precision makes rounding part of the computation Floating point math is not real-number math. Every add and multiply is followed by rounding to the representable format you are using. That rounding is not “noise.” It is part of the computation. Once you accept that, one consequence becomes unavoidable. Order matters. NVIDIA states the rule clearly: floating point involves rounding, and when you parallelize you can change operation order, so parallel results may not match sequential results. Why LLM inference is a perfect storm: reductions everywhere Now connect that to what an LLM does at inference time. LLM inference is reduction-heavy by design. Dot products in GEMMs, attention score accumulation, softmax normalization, layer norm statistics, even top-k selection pathways. These are not single operations. They are many partial operations combined into a final scalar or vector. In floating point, the way you combine partials is the outcome. GPU reductions are staged: partial sums, then merges A reduction on GPU is not “a sum.” It is a staged reduction of partials. On a CPU, you can imagine a left-to-right accumulation: ((((a1 + a2) + a3) + a4) + ...) On a GPU, that mental model is wrong. The GPU is built to run thousands of threads. So it computes partial sums in parallel and then merges them in stages. The staging pattern is determined by kernel design and how the backend maps the problem to hardware. Put the figure here, right after the staging idea lands. The staging depends on decisions you do not control at the prompt layer: how data is tiled into blocks how each block maps to warps how many partials each warp reduces whether it uses warp-level primitives, shared memory, or tensor core fragments how the final merge is staged across blocks Change the tile size, or the block shape, or the occupancy, and you often change the staging order. Change the staging order, and you change when rounding happens. You can get two results that are both correct under IEEE floating point rules, and they differ in the last bits. This is not a bug. It is the contract of finite-precision parallel math, applied at scale. Why the last bits move at the core level Floating point addition is not associative under rounding because rounding happens after each operation. The error introduced at each step depends on the magnitude and sign of what you are adding at that step. When you change the staging order, you change: which numbers get added together early which partial sums get rounded early how cancellation behaves when positive and negative terms interact when large and small magnitudes meet, where small values can lose representable impact That is the core mechanism behind “small deltas.” It is not mystical. It is mechanical. Why this shows up in production serving, not in your demo LLM inference is dominated by massive matrix operations and attention. Under the hood, those paths accumulate across large dimensions. An accumulation is exactly where rounding order matters most. And the server does not always run the same kernel variant for those ops. Under load, shape shifts and workspace pressure can push the backend into different implementations. Different implementations often imply different tiling. Different tiling implies different staging. Different staging implies different rounding. Different rounding implies different last bits. So even with an identical prompt, identical checkpoint, and temperature set to zero, you can still see tiny numeric differences when: batch composition changes and produces different effective shapes the engine picks a different algorithm because workspace is tighter the kernel selects a different tile path due to shape class and occupancy the GPU is in a different pressure regime, changing feasibility and scheduling behavior Those deltas are small, but they are real. And in decoding, small can be enough. The bridge from ulps to language: logits, argmax, divergence A tiny last-bit difference is often irrelevant, Until it hits a decision boundary. At decode step t, greedy decoding chooses an argmax. If the top logits are close, a small delta can swap the ordering. Once token t changes, the context changes, and the completion diverges. That is not randomness. That is deterministic branching from a slightly different numerical pathway. So the actionable takeaway is not “GPUs are nondeterministic.” It is this. Parallel math is allowed to produce multiple correct last-bit outcomes, and LLM decoding can amplify those outcomes into different text when margins are thin. CUDA scheduling makes ordering a form of runtime state CUDA makes a stronger statement than most people realize. Thread blocks must be able to run independently. It must be possible to execute blocks in any order, in parallel or in series. That is why the same kernel can execute with different inter block ordering depending on occupancy, contention, and scheduling. Now bring atomics into the picture. Atomics guarantee correctness of each update. They do not guarantee the arrival order of updates across threads and blocks. When floating point updates arrive in different legal orders, the final sum can differ in the last bits, because floating point addition is not associative. If you do not know what atomic add means, here is the useful definition. Atomic add ensures updates do not overwrite each other. It does not ensure which thread gets there first. This is the nano architecture layer that explains a lot of weirdness. Many engineers assume determinism is a property of weights. In practice, determinism is constrained by the legal reorderings of parallel execution. Logit margin is the bridge from ulps to language Now we connect the last bits to a changed sentence. At decode step t, greedy decoding picks the argmax over logits. Let the top two logits be ℓₐ and ℓ_b. Define the margin: mₜ = ℓₐ − ℓ_b A token flip happens when a small perturbation changes the ordering of these top two. If you want an operational translation, it is this. If the model barely prefers token A over token B, a tiny numeric delta can make it prefer B. Once token t changes, the rest of the completion evolves under a different context. Divergence is expected. This is why I keep pushing one instrumentation idea that sounds boring until you need it. Measure early step margins. You cannot manage stability if you never measure how close the decision boundary is. The effective request problem, the quiet killer of reproducibility Here is the pattern I see in almost every serious production investigation. The team replays the user prompt, cannot reproduce the output, and concludes the model is nondeterministic. Then the incident dies in ambiguity. And then, usually too late, someone asks the only question that matters. What did the model actually see. “In every postmortem, I ask one question before I look at weights, kernels, or seeds: what did the model actually see. If we cannot answer that, nothing else is evidence.” - Hazem Ali In production, the user prompt is not the input. It is an ingredient. By the time a request reaches the model, it has passed through a shaping pipeline that exists to keep the system safe, fast, and multi-tenant. That pipeline is not cosmetic. It can change semantics, length, and even decode behavior. The result is the only input that matters for reproducibility. The effective request. This is the same thesis you have already accepted earlier in the article. y = Decode( Exec(θ, x; s) ) If you do not know x, your replay is not valid. If you do not know s, your replay is not comparable. And if you only log the raw prompt, you are logging neither. Shaping changes semantics, not just length Truncation is the obvious one. Under load, systems often cap context length to protect latency and GPU memory. Same prompt, different truncation boundary, different effective context, different output. Nothing “random” happened. You executed a different input. But truncation is only the beginning. Policy injection can prepend or append system text that changes intent. Tool schema expansion can add hundreds or thousands of tokens and push the request over a context boundary. Routing metadata can select a different template. Prefix caching can reconstruct parts of context from cached state rather than raw text. Safety transformations can rewrite or neutralize content. Even small differences here can shift early logits when margins are thin, and this article already showed how small deltas become different tokens. The worst part is that this is silent by default. The user sees their prompt. Engineers see the prompt in logs. The model sees a different token sequence. Then everyone argues about reproducibility using the wrong input. Why this interacts with load, not just correctness Under low load, your system often has enough headroom to be generous. Longer context, fewer cutoffs, stable routing, more consistent batching, and fewer fallbacks. Under real load, shaping becomes defensive. Dynamic truncation thresholds kick in. Tool schema expansions collide with context limits. Prefix reuse behavior changes. Safety gates can become stricter. The same user text can produce a different effective request, and therefore a different output, precisely when the system is under pressure. So if you are only validating reproducibility at idle, you are validating a different system than the one you ship. What principals should require in telemetry If you want strict reproducibility, you must log the execution contract per request. Not the story. The contract. At minimum: effective token count after shaping truncation boundary and reason final merged decode config actually applied policy gates that modified prompt or decode path whether prefix cache was used, and what cache key was referenced routing template version and system message hash If you are privacy constrained, you still can log hashes and structural facts. You do not need raw prompts to diagnose effective request drift. You need verifiable fingerprints. Here is the short version in one line. If you only log the user prompt, you have not logged x. You have logged an approximation of x. And without x, you cannot claim reproducibility. You can only hope for it. Continuous batching, why time becomes part of semantics This is where principal level thinking matters. Continuous batching does not just increase throughput. It changes the execution context at each token step. Batch composition changes shapes. Shapes influence kernel selection and workspace feasibility. Those choices can change reduction structure and rounding pathways. If you want a published anchor, use vLLM. The PagedAttention paper frames high throughput serving as a need to batch many requests, but KV cache grows dynamically and wastes memory through fragmentation. It proposes PagedAttention and builds vLLM on top of it, with block level memory management and flexible sharing of KV cache to reduce memory usage. (arxiv) Here is what this really means in production. The server is selecting which requests share a step. That changes the math shapes. That changes the executed plan. That is why the same prompt behaves differently under load even at temperature zero. Algorithm selection and engine fallback The hidden variability people forget about If you have ever tried to reproduce a drift across replicas and felt like you were chasing ghosts, this is usually the layer you were missing. Libraries and engines choose, Not in a philosophical sense. In a literal, per-operator, per-shape sense. The same attention call is a fork in the road between multiple legal tactics, each with different tiling, different reduction staging, different fusion boundaries, and different temporary memory requirements. Your checkpoint is the same, your prompt is the same, your temperature is zero, and the output still moves because the executed plan moved. PyTorch says the quiet part directly. Disabling cuDNN benchmarking makes cuDNN deterministically select an algorithm, and PyTorch stresses this is different from the deterministic setting. That is the whole story in one sentence: one switch affects how the backend selects an algorithm, another affects whether the selected algorithms are deterministic. Those are separate layers, and under load they can diverge. Now go down to the core of the core. A tactic is not fast or slow. In production serving, a tactic is legal or illegal under the constraints of this token step. The constraint that forces most plan switches is not compute. It is workspace feasibility. Many high-performance kernels need scratch buffers. Some need enough contiguous space to stage tiles, reorder operands, hold partials, or run fused epilogues. When VRAM is fragmented or headroom drops, a tactic becomes impossible even if it is the tactic you validated at idle. The engine does not throw a warning. It simply selects another legal tactic. That is the first uncomfortable point. The second uncomfortable point is what makes this align perfectly with the next section. The constraint is not only “how many MB are free.” The constraint is the memory hierarchy state of the chip. Under load, two replicas can have the same free VRAM and still be in a different regime because the chip is not one pool of memory. It is HBM plus an on-die L2, plus TLBs, plus page tables, plus a fabric that is arbitrating traffic between SMs, L2 slices, and HBM controllers. When that hierarchy shifts, latency per token step shifts. And in continuous batching, a few milliseconds is not a timing detail, it is a scheduling input. This is how a performance event becomes a behavior event without any bug. The engine’s planner sees a world where a tactic that was “best” at idle is no longer best, or no longer feasible, because the chip is in a different pressure state. Your runtime is still correct. It is just operating a different plan in a different regime. One op, multiple legal kernels. The chosen tactic depends on shape class and feasibility. Now bring TensorRT into the picture, because it makes the precision dimension explicit. TensorRT states TF32 Tensor Core usage is not guaranteed and it can fall back to FP32, and it documents configuration controls around TF32. That statement is not about “precision preference.” It is about the reality that precision is part of tactic selection. Precision changes which instructions execute and how accumulation is staged. When your early logit margins are thin, a small pathway delta can swap the argmax at one step. One token flips, and the rest of the completion deterministically diverges. So “temperature zero” is not a determinism guarantee. Temperature governs sampling. It does not pin the execution pathway. If you want a more mechanical anchor, treat matmul the way NVIDIA exposes it: cuBLASLt has a preference descriptor for applying algorithm search preferences and fine-tuning the heuristic function. That is not marketing. That is the API admitting that algorithm selection is a constrained search problem. Now the part that gets rare, and the part most teams never write down. CUDA’s programming model requires that thread blocks be able to execute independently and may execute in any order, in parallel or in series. This matters here because tactic switches often change block geometry and tiling. Different block geometry changes reduction staging. Reduction staging changes where rounding happens. Even if every operation is correct, last bits can move because you legally changed the staging of partial sums. You do not need randomness. You need a different legal staging tree. Now pull security into the same frame, because it is not a separate layer in production. Security posture changes what the scheduler is allowed to do. Isolation constraints reduce batching freedom. Reduced batching freedom increases tail latency. Tail latency pushes you toward tighter admission controls and more aggressive memory behavior. That shrinks the feasible tactic set sooner. In other words, security decisions can move you across regime boundaries faster, which increases plan switching frequency. Stability becomes an SLO dimension of your security posture, not a property of your weights. This is the business consequence that shows up in the worst possible way. So here is the operational rule I use in reviews. If you cannot prove which plan ran, you cannot claim reproducibility. And that leads to the only practical addition that belongs in this section before we move into VRAM bandwidth and cache residency. VRAM bandwidth, cache residency, and why memory hierarchy becomes control plane input Let’s talk about the performance facts that quietly become behavior facts. And yes, I know how complex this gets. I have watched strong staff and principal engineers get lost here, not because they are weak, but because the system crosses too many layers at once: GPU microarchitecture, allocator behavior, kernel tactics, batching policy, and SLO-driven control loops. No single dashboard shows you the full causal chain. That is exactly why I frame it this way. It is not “performance tuning.” It is a coupled control system. So let me break it down cleanly, from the chip outward, until the behavior change becomes inevitable. NVIDIA describes H100 SXM5 as having HBM3 bandwidth around 3 TB/s and an L2 cache of 50 MB designed to reduce trips to HBM by caching repeated accesses. Most teams read that as “the GPU is fast.” In serving, it is more precise to say: the GPU gives you a memory hierarchy with regimes, and your runtime is forced to adapt to whichever regime you are currently in. The chip-level model you should carry in your head Decode is not one big matmul. It is a loop that repeatedly touches a shifting working set: KV blocks for the active sequences attention metadata (block tables, indirection, masks) sampling buffers (logits, top-k/top-p structures) runtime bookkeeping for continuous batching Those accesses are not purely streaming. They are pointer-heavy, and their locality depends on how your KV is laid out, which requests are co-scheduled, and how fragmented your memory becomes under churn. Here is the simplest mental model that is still honest: B_HBM is the number of bytes actually read from HBM during this step. B_L2miss is the number of bytes that missed L2 and therefore had to be fetched from HBM. t_translate is the address-translation tax: extra time from TLB misses and page-table walks. That last term is the one that surprises people. It’s “invisible” until it dominates. Why L2 residency becomes a control-plane input Now connect that to decode, Decode repeatedly reads KV state. If L2 hit rate drops, HBM traffic rises. When HBM traffic rises, stalls rise. When stalls rise, token-step latency shifts. When token-step latency shifts, the server changes batching decisions. This is the control loop you should keep in your head: L2 hit rate ↓ → t_step ↑ → Δt ↑ → batch composition changes → shape class changes → tactic set changes That is the bridge from “cache miss” to “different plan executed.” In continuous batching, time is not just an output metric. Time is an input into the next scheduling decision. A few milliseconds can change who gets co-scheduled at the next token step. That changes shapes. Shapes change feasible kernels and algorithms. That changes the executed math. And if early logit margins are thin, a small pathway delta can flip a token and send the rest of the completion down a different branch. Rare but matters: the translation tax that breaks the “free VRAM” illusion Two replicas can report similar free VRAM and still be in different regimes. Why? Because the chip is not “a pool of memory.” It is an on-die cache, translation structures, page tables, and a fabric that is arbitrating traffic under pressure. When KV is stored in blocks (or pages) and those blocks are scattered due to churn, you often get: worse spatial locality more distinct memory regions per step more TLB pressure more page walks Page walks are not abstract. They are memory reads. They compete with your payload reads. Under real load, this turns into self-inflicted HBM traffic. So you can be “bandwidth rich” on paper and still be “latency poor” in practice because the working set became translation-hostile. This is how a performance event becomes a behavior event without any bug. A concrete KV bandwidth sanity check If you want a back-of-the-envelope check for why decode becomes memory-shaped, use a conservative estimate. Per token step, you often need to read a large portion of KV for the active context. A rough model is: KV bytes per step ≈ 2 × B × L × H × D × s Where: B is batch size (number of sequences co-scheduled in the step) L is current context length (tokens already in KV) H is the number of attention heads (or KV heads, depending on the model) D is head dimension s is bytes per element (2 for fp16/bf16, 1 for int8, etc.) The factor 2 accounts for K and V. Even if your kernel is compute-efficient, you are still moving a lot of bytes. If locality collapses and L2 misses rise, you shift into an HBM-limited regime fast. That is the mechanical reason your p95/p99 step time moves under load, even with the same checkpoint and temperature. Business impact, stated plainly This is why drift shows up where it hurts: p95 and p99. At idle, L2 residency is generous, fragmentation is lower, translation pressure is calmer, and step time is stable. Under load, residency collapses, translation tax rises, allocator feasibility tightens, step time stretches, and your control plane adapts by changing batching and shapes. That can move you into different execution plans without any model change. An enterprise buyer does not care whether you call it “L2 miss driven plan churn.” They care that two identical requests disagree and you cannot explain it. So the takeaway I want principals to internalize is simple: In continuous batching, memory hierarchy state is control-plane state. It shapes latency. Latency shapes batching. Batching shapes shapes. Shapes shape feasibility. Feasibility shapes the executed plan. That is how “performance” becomes “behavior.” Multi node tensor parallel, the execution plan extends across the fabric Once you go multi-node tensor parallel, you add a second execution plane that most teams underestimate. You are no longer operating only a GPU runtime. You are operating a distributed timeline. And the timeline is not a background detail. In continuous batching, the timeline becomes a control input that reshapes batching, shapes, and eventually the executed plan. Let me be precise about what I am claiming, and what I am not. I am not going to claim collectives reorder arithmetic inside a kernel. That would be sloppy. The correct claim is this: Distributed synchronization changes the timeline. The timeline changes admission and batching. Batching changes shapes. Shapes change which plans are legal. That’s enough to explain why the “same prompt, same checkpoint, temp=0” can drift only under real load. The minimal equation you should carry At each decode step, your latency is no longer “GPU time.” It’s GPU time plus fabric time: t_step ≈ t_compute + t_comm + t_sync And the part that hurts is that t_comm and t_sync are not stable. They are affected by contention, queueing, stragglers, and topology. A useful mental model for the communication piece is the classic latency–bandwidth form: t_comm(message) ≈ α + (n / β_eff) α is the per-collective startup and synchronization overhead n is bytes moved β_eff is the effective bandwidth you actually get under contention In isolation, this looks like performance math. In a continuous batching server, this becomes behavior math, because t_step feeds back into the next scheduling decision. What actually happens in multi-node TP at token cadence Tensor parallelism shards the model across devices. Every token step requires cross-device coordination for some portion of the layer execution. In practice, this means collectives become part of the critical path. NCCL’s collective ops are explicit about the semantics: for example, AllReduce reduces values across ranks and returns identical results to all ranks. That tells you what the runtime must do: it must wait for coordination across ranks before progressing. So the decode loop becomes: execute local compute for this step hit a collective boundary wait for the slowest rank to finish and for the fabric to deliver proceed That “slowest rank” detail is the piece people feel but rarely name. In distributed inference, p99 is often a straggler story. A single congested link, a slightly delayed rank, or a transient fabric stall turns into a global stall because collectives synchronize progress. In other words, a multi-node TP system behaves like a coupled oscillator: the fastest GPU is still gated by the slowest collective. Why this changes the executed plan, not just the latency Here’s the bridge to the thesis of the whole article. In a continuous batching server, you do not just execute requests. You continuously reform microbatches at token cadence. That means step time affects who joins the next step. And in multi-node TP, fabric jitter is one of the biggest sources of step-time variability. So when comm jitter shifts t_step, it shifts the schedule: queue delay changes microbatch membership changes effective shape class changes workspace feasibility changes tactic choice changes You already established earlier that a changed shape class can force a different tactic set. Multi-node TP adds a new reason shape churn happens: not only GPU pressure, but fabric timing pressure. So the claim stays clean and defensible: Distributed synchronization doesn’t need to change arithmetic to change behavior. It only needs to change the timeline that drives batching. Chip-to-fabric reality: why infrastructure details belong in the reproducibility record At this scale, the infrastructure is part of the runtime. According to Azure Docs, Azure’s ND H100 v5 series is explicitly positioned for tightly coupled scale-up and scale-out Generative AI and HPC workloads, and it’s built around the idea that the fabric matters, not just the GPUs: If you are running multi-node TP in production, treat fabric telemetry as part of your reproducibility record. Not because it is fun. Because it changes the system timeline that drives batching. A practical minimum is to track per-step: collective type on the critical path (e.g., all-reduce / all-gather) comm time and jitter (p50/p95/p99 per step window) rank skew (max(rank_time) − min(rank_time)) effective bandwidth estimate (n / t_comm) retransmit / congestion signals if your stack exposes them a “fabric regime” marker: normal vs congested vs degraded When drift becomes expensive This is one of the reasons enterprise teams report the most confusing failures only at load. At idle, your timeline is stable, your microbatches are stable, your shapes are stable, and your plan selection is stable. Under real load, the fabric introduces jitter, jitter reshapes batching, batching reshapes shapes, and shapes reshape the executed plan. Now two replicas can disagree, not because the model changed, but because the timeline differed. That shows up as: inconsistent answers across replicas in high-stakes workflows reproducibility failures during audits and incident reviews “regressions” after scaling out, even with the same checkpoint and code support costs and credibility loss because you cannot explain why behavior changed only at p95/p99 So the operational sentence I want you to carry into your postmortems is: In multi-node tensor parallel inference, the execution plan extends across the fabric. If you do not log the fabric timeline, you are missing part of the runtime state that decides which plan was feasible. Where Infrastructure Stops Being “Just Infrastructure” Once you accept the thesis of this article, one conclusion becomes unavoidable: cloud choices are not just cost and convenience decisions. They shape which execution regimes your runtime will enter under pressure. At scale, you are no longer buying “GPUs.” You are buying: A fabric and topology that holds up under synchronized token-step collectives A VM family with predictable characteristics for tightly coupled scale-out workloads (the kind multi-node inference actually is) An isolation posture that can be enforced in hardware when your threat model requires it, without hand-waving away the runtime implications First-class observability for GPU behavior, not just CPU and request traces, so you can correlate drift with the state variables that caused it (for example, exporting NVIDIA DCGM metrics into managed Prometheus and Azure Managed Grafana on AKS). This is the quiet reason certain platforms feel “more stable” in production. Not because the model is different, but because the runtime state is easier to constrain, measure, and explain when the underlying infrastructure is designed for the exact regime you’re operating in. Quantization effects on execution paths and causal stragglers in multi-node TP Let me be direct about what most articles miss when they discuss distributed inference at scale. The conversation typically stops at "how many GPUs" and "what's the bandwidth." That's not wrong. It's just incomplete. What's missing is the interaction between quantization-induced plan churn and straggler amplification in the collective path, two forces that quietly reshape your execution regime under VRAM pressure and fabric contention. These are not theoretical curiosities. They are production realities at 100+ GPU scale, the kind of scale where you can no longer afford to treat quantization as a "precision choice" or stragglers as a "latency outlier." At that scale, they become causal inputs to your runtime's decision surface. Quantization variability: not just precision, but plan selection When teams talk about INT8 or FP8 quantization, the conversation usually centers on memory savings and throughput gains. That's the marketing layer. The execution layer is more nuanced: quantization changes which kernels are legal, where fusion boundaries land, and how reduction trees are staged. Here's what I mean in concrete terms. Under VRAM pressure, your serving stack may need to requantize activations mid-forward-pass to stay within memory bounds. That requant step is not "free" in the plan sense. It introduces: dequant/requant cycles that break fusion opportunities you had in the FP16 path new non-associative operations in the reduction tree, where rounding happens at different stages fallback paths when the quantized kernel variant lacks workspace or doesn't support the current shape class Let me state this in the language of the article's thesis: quantization is not a data type. It is a tactic constraint that reshapes the feasible plan space. Memory pressure can force dequant/requant cycles, change fusion boundaries, and trigger fallback kernels with different reduction staging, producing last-bit differences that can flip tokens during decoding. The practical consequence? Two replicas running "the same quantized model" can execute different kernel variants when one is memory-pressured and the other is not. The memory-pressured replica may be forced into a fallback path with different reduction staging. Different staging means different rounding order. Different rounding order means different last bits. And in decoding, last bits can become different tokens. I've watched incident reviews where teams assumed INT8 was "deterministic" because they set the quantization scheme once at export time. What they missed is that the runtime's quantization pathway depends on the state of VRAM fragmentation, workspace availability, and kernel preference histograms, exactly the regime-dependent variables we've been building toward throughout this article. If you're operating at scale, instrument this. Track: per-step kernel selection via cuBLASLt preference descriptors dequant/requant cycle counts when memory pressure rises fallback events when preferred quantized tactics become infeasible whether the executed plan matched the "expected" quantization pathway This is rare telemetry. Most teams never see it because they're not running large enough clusters under sustained pressure. But once you cross into 100+ GPU inference workloads, quantization-induced plan churn becomes visible in your p99 drift signatures. Causal stragglers: when one rank's fallback stalls the collective Now let's talk about the fabric-scale pathology that couples with everything we just discussed: head-of-line blocking in distributed tensor parallelism. You already know from the multi-node TP section that collectives synchronize progress. The fastest rank waits for the slowest. That's the contract. What's less documented—and what I've only seen formalized in internal NVIDIA serving postmortem templates—is how a single rank's kernel fallback can become a collective-wide straggler, and how that straggler amplifies through the batching feedback loop. Here's the causal chain: One rank enters memory pressure. Maybe fragmentation is worse on that device, maybe it's handling a slightly different KV layout due to request assignment. That rank falls back to a slower tactic. The preferred kernel requires workspace. Workspace isn't available. The engine selects a legal fallback. The fallback kernel takes longer. Not by seconds—by milliseconds. But in a collective, milliseconds matter. The collective waits. AllReduce can't proceed until all ranks contribute. The straggler becomes the bottleneck. Step time stretches. The stretched step reshapes the next batch in continuous batching. Different batch, different shapes, different feasibility. The cycle repeats. Now multiple ranks may be in fallback paths. The p99 drift you're seeing isn't random—it's a feedback loop. This is what I call a causal straggler: not just a slow rank, but a rank whose performance degradation causally reshapes the execution regime of the entire TP group. And here's where quantization and stragglers intersect. If one rank is under more VRAM pressure and is forced into more frequent dequant/requant cycles, it becomes the straggler. Its quantization pathway differs from the other ranks—not because the model changed, but because the memory regime changed. That difference in pathway becomes a difference in step time. That difference in step time becomes a collective stall. That stall becomes a batching change. That batching change becomes a new plan. The output drifts, and you're left wondering why "the same checkpoint at temperature zero" produced different text only under load. The answer is: you weren't in the same execution regime. You were in a regime where one rank's memory pressure caused a straggler, the straggler caused a timeline shift, and the timeline shift caused a plan change. Rarity value: why this knowledge is elite production battle scars Let me be honest about why these gaps are rare. Most teams never operate at the scale where these effects dominate. If you're running inference on 8 GPUs, you might see hints of this. At 100+ GPUs with multi-node TP and continuous batching under sustained load, it's no longer a hint—it's the signature. The teams that do operate at this scale track: cuBLASLt preference histograms to detect when algorithm selection is churning across steps NCCL timeline traces to identify straggler signatures and correlate them with per-rank memory state per-rank kernel fallback events to see when one device is operating a different plan than its peers quantization pathway divergence across ranks under pressure This is the telemetry that doesn't show up in tutorials. It shows up in postmortems at hyperscaler SLO thresholds, where p99 latency violations trigger incident reviews and someone finally asks: "Why did replica 3 disagree with replica 1 only during the peak load window?" The article you're reading now covers single-node memory regimes beautifully. What bridges to 10/10 elite production knowledge is this: fabric-scale causality. The understanding that in multi-node TP, your execution regime is not just shaped by your GPU's memory state—it's shaped by the worst GPU's memory state, because collectives couple everyone's timeline. That's the gap. That's the rarity value. And if you're building or operating inference at 100+ GPU scale, that's the layer where your next outage is hiding. Peak depth: wavefront divergence, tensor core fragmentation, NCCL backpressure, and ISR collision Everything above operates at the principal and staff engineer level. What follows is the layer below that—the chip architect handoff, where you stop talking about "plans" in the abstract and start talking about warp stall cycles, tensor core fragment occupancy, NCCL retransmit chains, and memory evaporation under replication pressure. I'm writing this section because it's the part I never see published outside internal design reviews, and because these are the exact pathologies that turn a well-architected inference cluster into a system that disagrees with itself only during peak traffic. "Most engineers debug the layer they understand. The system breaks at the layer they don't. In production inference, that layer is almost always the one where microarchitecture meets scheduling meets the fabric." — Hazem Ali Wavefront divergence in decode attention kernels Let me take you inside the warp. In SIMT execution, a warp is 32 threads executing in lockstep. When all threads follow the same control path, you get full utilization. When they diverge—different threads take different branches—the warp must serialize both paths. That's textbook GPU architecture. What's not textbook is how this interacts with paged KV attention in production decode loops. In a paged KV system (the exact kind vLLM introduced), KV blocks are scattered across VRAM. Different sequences in the same microbatch may have their KV blocks in different residency states: some hot in L2, some cold in HBM, some partially evicted under paging pressure. When the attention kernel issues loads for KV blocks, threads within the same warp can stall at different rates depending on which blocks they're accessing and where those blocks reside. This creates a subtle but measurable pathology: Lane divergence inside the attention kernel. Not control-flow divergence in the traditional sense, but memory-latency divergence: some lanes return fast (L2 hit), some stall (HBM fetch), and the warp can't retire until the slowest lane completes. Register pressure amplification. When warps stall, the SM must keep their register state live. Under heavy stalling, register pressure rises, which can force the compiler to spill to local memory (which lives in L2/HBM). Spills create more memory traffic, which creates more stalls. It's a feedback loop at the microarchitectural level. Measurable p99 step variance in identical-shape batches. This is the part that confuses teams. Two consecutive decode steps with the same batch size and the same sequence lengths can have different step times, because the KV block residency pattern differed. The shape was identical. The memory topology was not. If you want to see this in practice, the tool is Nsight Systems. What you're looking for: # Nsight Systems trace analysis: partition warp stall cycles # Look for these stall reasons in the GPU metrics view: # - smsp__warps_issue_stalled_long_scoreboard → memory dependency stalls # - smsp__warps_issue_stalled_short_scoreboard → register dependency stalls # - smsp__warps_issue_stalled_no_instruction → instruction cache miss # # Correlate with: # - l1tex__t_sectors_pipe_lsu_mem_global_op_ld → global load sectors (KV fetches) # - lts__t_sectors_srcunit_tex_op_read_hit_rate → L2 hit rate during attention # # The diagnostic signal: when stall_long_scoreboard spikes correlate with # L2 hit rate drops, you're seeing KV residency divergence across warps. The stall partition tells you why the warp stalled. When you see long_scoreboard stalls dominating during attention kernels—and you see them correlating with L2 miss rate fluctuations—you're observing exactly the KV residency divergence I'm describing. The warp is waiting for scattered KV blocks, and the scatter pattern changes with every batch because paging decisions are state-dependent. This is how "identical shapes" produce different timelines. The shape is the same. The KV block map is not. And the block map is a function of runtime allocation history—the same state-dependent variable that drives everything else in this article. Tensor core fragment utilization collapse under shape churn Now let's go inside the tensor cores themselves. H100 and Blackwell tensor cores operate on matrix fragments—fixed-size tiles that map directly to the hardware's matrix multiply-accumulate units. On H100, the native fragment sizes for FP16 are typically 16×16×16 (m×n×k). When your operand dimensions align cleanly with fragment boundaries, you get full utilization. When they don't, you get fragment waste: the hardware still executes full fragments, but some of the lanes carry padding zeros. In continuous batching, shape churn is the norm. Your microbatch dimensions change at token cadence. And this is where a subtle but devastating efficiency collapse hides. Consider two microbatches that arrive one step apart: # Step t: B=16, L=2048 → GEMM shape aligns cleanly with 16×16 fragments # Fragment utilization: ~98% # cuBLASLt selects: WMMA-based kernel (tensor core native) # # Step t+1: B=17, L=2047 → GEMM shape straddles fragment boundaries # Fragment utilization: drops below 25% on trailing tiles # cuBLASLt selects: fallback to non-WMMA FP16 kernel # (or WMMA with heavy padding, depending on heuristic) The difference is one sequence in the batch and one token in context length. The performance consequence is that the runtime switches from tensor core native execution to a scalar FP16 path. That's not a minor variant. That's a fundamentally different instruction mix, a different reduction tree, and a different accumulation order. The ulp deltas that result from this switch don't stay contained in the GEMM output. They propagate forward through layer normalization—which is itself a reduction over the hidden dimension. Layer norm amplifies small differences because it divides by a variance term computed from the same values. A tiny shift in the GEMM output becomes a slightly different variance, which becomes a slightly different normalization, which becomes a slightly different input to the next layer's attention. You can observe this directly via cuBLASLt's algorithm preference reporting: # cuBLASLt algorithm preference histogram (conceptual) # Track per-step which algorithm ID was selected for the primary GEMM # # Healthy (stable shapes): # algo_id=42 (WMMA_TENSOR_OP_HMMA_16816) → 99.2% of steps # algo_id=17 (SIMT_FP16_SPLITK) → 0.8% of steps # # Under shape churn (continuous batching, mixed lengths): # algo_id=42 (WMMA_TENSOR_OP_HMMA_16816) → 61.3% of steps # algo_id=17 (SIMT_FP16_SPLITK) → 22.1% of steps # algo_id=31 (WMMA_TENSOR_OP_PAD16) → 16.6% of steps # # When algo_id distribution churns, your reduction tree is churning. # When your reduction tree churns, your last bits are churning. # When your last bits churn under thin margins, your tokens can flip. That histogram is the smoking gun. When you see algorithm preference distribution widening under load, you're watching the tensor cores get destabilized by shape churn. The fix isn't "use bigger batches." The fix is to understand that continuous batching creates a shape distribution, not a fixed shape, and that shape distribution maps directly to a tactic distribution, which maps directly to a ulp distribution. NCCL causal backpressure chains across TP+DP pods Now scale this to the fabric. Take an 8×TP + 4×DP pod: 32 GPUs total, where every token step requires AllReduce across the 8-way TP group, and gradient synchronization (or KV redistribution in some architectures) across the 4-way DP group. Here's the causal backpressure chain I've traced in production, laid out as a timeline: Rank 5 (of 8 TP ranks) hits a quant/dequant stall. Its KV blocks are fragmented, workspace is tight, and the runtime forces a dequant cycle mid-attention. That adds ~1.2ms to this rank's compute. AllReduce stalls on Rank 5. The other 7 ranks complete their portion and issue their NCCL send. Rank 5 hasn't arrived yet. NCCL's ring/tree protocol can't progress past this rank. Effective t_sync inflates by 2× compared to the no-straggler baseline. P2P retransmit triggers. Under some fabric topologies and congestion states, the delayed arrival from Rank 5 can cause NCCL to hit internal retry logic on the NVLink or InfiniBand path. This is not a "network error"—it's the transport protocol managing flow control under backpressure. But it adds latency jitter that is invisible unless you're tracing at the NCCL bootstrap level. vLLM scheduler reacts to the stretched step. The scheduler sees that step t took 2× longer than expected. Under its latency-aware admission control, it drops batch size from 32 → 12 to protect SLO. Smaller batch means different shapes. Different shapes mean different tactics. The plan changes. The batch size drop propagates. With batch size at 12, queued requests wait longer. Queue pressure builds. When the scheduler recovers and re-admits, the burst creates shape churn. Shape churn destabilizes tensor core fragment utilization. The system is now in a different execution regime—triggered by one rank's memory fragmentation. That is a causal backpressure chain. Not a latency spike. Not a network blip. A causally connected sequence where a microarchitectural event on one device reshapes the execution plan across the entire pod. To trace this, you need NCCL bootstrap traces with NVTX domain annotations: # NCCL tracing with NVTX domains for causal analysis # # Environment setup for trace collection: # NCCL_DEBUG=INFO # NCCL_DEBUG_SUBSYS=INIT,COLL,P2P # NSYS_NVTX_DOMAINS=nccl,cuda,cublas # # In Nsight Systems, correlate: # 1. Per-rank kernel duration (cuda domain) — identify the straggler # 2. NCCL collective start/end (nccl domain) — measure t_sync inflation # 3. P2P transport events (nccl/P2P) — detect retransmit/backpressure # 4. Scheduler batch decisions (application NVTX) — see batch size reaction # # The causal signal: when rank N's kernel duration spike aligns with # NCCL collective inflation across all ranks, followed by batch size # reduction in the scheduler, you have a causal backpressure chain. # # Regex for filtering straggler events in nsys export: # grep -E "ncclAllReduce.*duration_us > (2 * median_duration)" trace.sqlite # → correlate timestamp with scheduler batch_size change events This is the telemetry that separates "we think there was network jitter" from "Rank 5's dequant stall caused a 2× collective inflation that forced the scheduler to halve batch size, which shifted the shape class into a non-WMMA tactic for the next 47 steps." The first is a guess. The second is a causal explanation. And in an incident review at scale, only the second one survives. ISR + checkpoint overlap pathology: memory evaporation under replication pressure This is the deepest pathology in this article, and it almost never surfaces below 512 sequences per second. Large-scale inference deployments use incremental state replication (ISR) for fault tolerance: rather than checkpointing the entire model state, you replicate KV cache deltas and scheduler state to a standby node incrementally, so failover is fast. Separately, many systems run async checkpointing for recovery: periodic snapshots of model and optimizer state written to persistent storage, overlapped with inference to avoid blocking the decode loop. Under normal load, these two systems coexist peacefully. ISR replicates small deltas. Checkpointing writes in the background. Memory headroom is sufficient for both. Under paging pressure—the exact regime we've been discussing throughout this article—they collide. Here's the pathological interaction: The system is under VRAM pressure. KV blocks are being paged (allocated, evicted, re-allocated) at high frequency. Memory headroom is thin. ISR kicks in. It needs to replicate recent KV deltas to the standby. To do this, it must pin certain KV blocks in memory while it serializes and transmits them. Async checkpointing overlaps. The checkpoint writer is also holding references to memory regions it's snapshotting. Under normal conditions, this is fine—there's enough headroom. Under paging pressure, the checkpoint's memory holds compete with ISR's memory holds. Memory evaporation. The combined pinning from ISR + checkpointing temporarily removes KV blocks from the pool available to the decode loop. The pager sees available blocks drop. It may be forced to evict active KV blocks—blocks that are needed for in-flight sequences—to make room. Evicted blocks must be recomputed. When a sequence's KV is evicted mid-collective (during an AllReduce, for example), the rank that lost its KV must recompute it. That recompute makes this rank the straggler. And we already know what stragglers do to the collective timeline. The straggler triggers the full backpressure chain. Collective stall → batch size reduction → shape churn → tactic churn → output drift. All caused by a fault-tolerance mechanism designed to keep you safe. ISR pins KV deltas for replication while async checkpointing pins regions for snapshotting. Under paging pressure, the combined pinning shrinks the decode-available KV pool, forces evictions and recompute, creates stragglers, and cascades into collective stalls → batch reduction → shape/tactic churn → p99 output drift. I call this memory evaporation because from the decode loop's perspective, VRAM that was available simply vanishes for a window of time. The blocks are still physically present—they're held by ISR and the checkpointer, but they're not available to the runtime. The effect is identical to a sudden drop in free VRAM, and the runtime reacts accordingly: it enters a pressured regime. This is why the pathology rarely surfaces below 512 seq/s. At lower throughput, there's enough headroom that ISR and checkpointing never compete meaningfully with the decode loop's memory needs. At high throughput under sustained load, the margins collapse, and the three systems—decode, ISR, checkpoint—start fighting over the same memory. The fix is not "turn off ISR." The fix is to coordinate memory budgets across these three subsystems and to treat ISR and checkpointing as memory consumers that participate in the regime calculation. If your regime function doesn't account for replication and checkpoint holds, it's underestimating pressure, and your system will surprise you at exactly the scale where fault tolerance matters most. # extended regime function accounting for replication and checkpoint pressure def regime_extended(vram_free_mb, paging_on, isolation_strict, queue_p95_ms, isr_pinned_mb, ckpt_pinned_mb, kv_pool_total_mb): effective_free = vram_free_mb - isr_pinned_mb - ckpt_pinned_mb effective_ratio = effective_free / kv_pool_total_mb if kv_pool_total_mb > 0 else 1.0 if isolation_strict: return "isolation_strict" if effective_ratio < 0.05: return "memory_evaporation" # ISR+ckpt collision if paging_on: return "paging" if effective_free < 1024: return "memory_pressured" if queue_p95_ms > 50: return "queue_degraded" return "normal" That "memory_evaporation" regime is the one you never see at idle. It only appears when throughput is high enough that ISR frequency, checkpoint frequency, and decode memory demand all peak simultaneously. And when it appears, it doesn't show up as an OOM. It shows up as a straggler, which shows up as a collective stall, which shows up as a batch size drop, which shows up as a shape change, which shows up as output drift at p99. That's the full causal chain from fault tolerance to token flip. The chip-architect handoff These four pathologies, wavefront divergence, tensor core fragmentation, NCCL backpressure, and ISR collision are what elevate from principal-level operational insight to chip-architect-level systems thinking. They share a common structure: A microarchitectural or infrastructure event occurs that is invisible at the API layer. The event changes the timeline or the memory topology, not the "inputs." The changed timeline or topology feeds back into scheduling, shaping, or tactic selection. The feedback loop produces a different executed plan. The different plan produces a different result that is correct by contract but different by observation. If you're instrumenting at this depth, you're not debugging anymore. You're operating a system where the observability itself is part of the architecture. And if you're carrying the thesis of this article to its logical conclusion: the executed plan is not just a function of the GPU state. It's a function of the warp state, the fragment state, the fabric state, and the replication state—all coupled through continuous batching at token cadence. Security is not a layer, it changes execution Now let’s go deep, because this is where a lot of principal level reviews go wrong. Teams talk about security as confidentiality and correctness as something separate. In multi tenant inference, they couple. IOMMU based GPU isolation and DMA remapping Microsoft documents IOMMU based GPU isolation as a technique to manage how GPUs access system memory, improving security and stability: Microsoft also documents IOMMU DMA remapping, describing how GPUs access memory through logical addresses that are no longer mapped one to one, enabling logically contiguous address ranges through translation: This matters for two reasons. First, it is a real hardware enforced boundary, not a policy checkbox. Second, boundaries introduce overhead and constraints. Constraints change what is allowed. Allowed execution choices shape the plan space. Confidential computing on H100 NVIDIA states that H100 is the first GPU to introduce support for confidential computing and that it can be used in virtualized environments with VMs or Kubernetes based deployments. Azure has also published general availability of confidential VMs with H100, which is the practical deployment side of this posture: Now the key architectural point. When you turn on stronger isolation, you often restrict sharing. You restrict cross tenant microbatching. You add attestation requirements. You change how memory is mapped and protected. That can reduce throughput. Reduced throughput moves you closer to regime boundaries. When the system crosses a regime boundary, the executed plan changes. Security posture becomes an SLO dimension. If you do not test it, you do not know what system you are running. GPU cache side channels, why sharing is not a theoretical risk There is published research that treats GPU caches as a leakage surface. The USENIX Security 2024 paper Invalidate plus Compare presents a timer free GPU cache attack primitive. I will not provide attack recipes. You do not need them to understand the conclusion. If your threat model includes untrusted co tenants, shared microarchitectural resources matter. If you respond by increasing isolation, your execution constraints change. That changes performance and can change the execution regimes your serving stack enters. Security and runtime behavior are coupled. State collapse, the phase transition that looks like model instability If you don’t know what state collapse is, imagine a highway that looks perfectly calm at 2 a.m. Every lane is open. Every car keeps its distance. Your ETA is stable. You run the same route ten times and you get the same arrival time. Then 8:30 a.m. hits. Nothing “broke” in the highway. The asphalt is the same. The speed limit is the same. The cars are the same. But the system crosses a density threshold. One small brake tap becomes a shockwave. Lanes start interacting. Merges become bottlenecks. A single slow truck creates a queue that ripples backwards. Suddenly your ETA isn’t a property of your car anymore. It’s a property of the traffic regime. That is state collapse in production inference. At low load, the system behaves stable. At high load, output drift appears. And teams mislabel it as “model instability,” or “LLM randomness,” or “temperature drift.” Most of the time, it is none of that. It is a phase transition in the runtime. You didn’t change weights. You crossed a regime boundary. What collapses, exactly State collapse is not “everything gets slower.” It is when the control plane loses the degrees of freedom it was using to keep execution consistent. Under low load, the runtime has slack: enough VRAM headroom to keep preferred tactics feasible enough cache residency to keep step times predictable enough scheduling flexibility to keep microbatch composition stable enough workspace contiguity to avoid algorithm fallbacks enough fabric stability (in multi-node TP) to keep step cadence tight Under high load, that slack disappears. The runtime stops being a “fast executor” and becomes a “survival scheduler.” And once it crosses that boundary, it starts making different decisions that are all valid, all correct by contract, and all capable of shifting outputs. This is why it feels like instability: the model hasn’t changed, but the executed plan has. Why this shows up as output drift, not just latency drift Because decoding is a branching process. A small numerical difference that does nothing in a benchmark can flip a token if the margin is thin. One flip changes the context. The context changes the next logits. Now you’re on a different path. So the runtime doesn’t need to be “wrong” to produce different text. It just needs to execute a different legal plan under a different legal regime. That is the whole thesis of this article, condensed into one sentence: Weights are static. Behavior is a property of the executed plan. The executed plan is a function of state. The common triggers that push systems into collapse You can treat these as the usual “threshold crossings” that shrink the feasible plan space: Memory headroom shrinks → feasible tactic set shrinks Preferred kernels often require workspace. When headroom or contiguity drops, tactics become illegal and the engine selects other tactics. Cache residency collapses → stalls rise → step timing drifts L2 hit rate drops, HBM traffic rises, and decode steps stretch. In continuous batching, stretched steps reshape the next batch. Continuous batching shifts the mix and shapes Under load, microbatch membership changes at token cadence. Shape class changes are not cosmetic; they change kernel feasibility. Framework and engine algorithm selection changes depending on settings Autotuning, benchmarking, and backend heuristics mean the “same op” can legally choose different algorithms. Under pressure, the best choice can become infeasible. CUDA execution permits ordering freedom and floating point order sensitivity remains true Parallel staging and legal reordering can shift last bits. Under thin margins, last bits can become different tokens. Nothing here requires a bug. This is what “execution under constraint” looks like. The incident question that stops the hand-waving If you want a more honest incident question, use this: Which execution regime ran, and what constraints pushed us into it? Not “was the prompt the same.” Not “were the weights the same.” Not “did we set temperature to zero.” Regime first. Because state collapse is not a mystery. It’s a threshold. And once you learn to name the threshold, you can instrument it, test it, and stop being surprised by it at p95 and p99. A reproducibility protocol that works for principals, not demos Logging prompts is not reproducibility. It is wishful thinking. If you want to be able to defend behavior, you need to reconstruct the execution state. Log the execution contract Per request, log: effective input length after shaping truncation boundary and reason decode configuration actually applied admission time, queue time, GPU time per step batch fingerprint or at minimum batch identity and shape class memory headroom watermark and whether you were in a pressured allocation regime engine precision mode settings and any fallback relevant flags cuDNN benchmark and deterministic settings if relevant isolation posture, including whether cross tenant batching is permitted Track margins early Track top two logit margins for early steps. Use it as a stability budget. If the margin collapses under a certain prompt family, treat that as a risk surface. Not every prompt is equally stable. Test under regimes, not at idle Do not run determinism tests at idle and call it solved. Test under: sustained concurrency mixed sequence lengths continuous batching realistic memory pressure real isolation posture If you do not do this, you are validating a different system than the one you ship. vLLM’s paper exists precisely because these conditions define the serving problem. Closing If you want production LLM behavior to be explainable, stop treating the model as the whole system. Weights are static. Executed math is selected under constraint. Behavior lives in the gap. You did not deploy weights. You deployed a physics constrained runtime that contains weights. And that runtime is allowed to change the executed plan, because floating point order matters, CUDA scheduling freedom is part of the contract, engines can choose precision pathways, and serving stacks intentionally reshape batching and memory. Acknowledgments While this article dives into the hidden memory mechanics that shape LLM behavior under load, I’m grateful it was peer-reviewed and challenged before publishing. A special thanks for Hammad Atta and Abhilekh Verma for peer-reviewing this piece and challenging it from a security-and-systems angle. If this article resonated, it’s likely because it describes a reality many teams encounter only after an incident: production LLM behavior is a property of the executed plan, and the executed plan is a function of state. If you’re running production inference at scale and observing behavior shifts under load—especially in tail-latency regimes, I’m happy to connect on LinkedIn. I’m open to substantive technical discussion. Thank you for reading. I hope this helps you surface the hidden variables in serving and turn them into telemetry, controls, and repeatable postmortem evidence. And if you’re seeing similar regime transitions or plan churn in your own deployments, I’d be interested to hear how it presents in your stack. — Hazem Ali Microsoft AI MVP, Distinguished AI & ML Engineer / Architect218Views0likes0CommentsLevel up your Python + AI skills with our complete series
We've just wrapped up our live series on Python + AI, a comprehensive nine-part journey diving deep into how to use generative AI models from Python. The series introduced multiple types of models, including LLMs, embedding models, and vision models. We dug into popular techniques like RAG, tool calling, and structured outputs. We assessed AI quality and safety using automated evaluations and red-teaming. Finally, we developed AI agents using popular Python agents frameworks and explored the new Model Context Protocol (MCP). To help you apply what you've learned, all of our code examples work with GitHub Models, a service that provides free models to every GitHub account holder for experimentation and education. Even if you missed the live series, you can still access all the material using the links below! If you're an instructor, feel free to use the slides and code examples in your own classes. If you're a Spanish speaker, check out the Spanish version of the series. Python + AI: Large Language Models 📺 Watch recording In this session, we explore Large Language Models (LLMs), the models that power ChatGPT and GitHub Copilot. We use Python to interact with LLMs using popular packages like the OpenAI SDK and LangChain. We experiment with prompt engineering and few-shot examples to improve outputs. We also demonstrate how to build a full-stack app powered by LLMs and explain the importance of concurrency and streaming for user-facing AI apps. Slides for this session Code repository with examples: python-openai-demos Python + AI: Vector embeddings 📺 Watch recording In our second session, we dive into a different type of model: the vector embedding model. A vector embedding is a way to encode text or images as an array of floating-point numbers. Vector embeddings enable similarity search across many types of content. In this session, we explore different vector embedding models, such as the OpenAI text-embedding-3 series, through both visualizations and Python code. We compare distance metrics, use quantization to reduce vector size, and experiment with multimodal embedding models. Slides for this session Code repository with examples: vector-embedding-demos Python + AI: Retrieval Augmented Generation 📺 Watch recording In our third session, we explore one of the most popular techniques used with LLMs: Retrieval Augmented Generation. RAG is an approach that provides context to the LLM, enabling it to deliver well-grounded answers for a particular domain. The RAG approach works with many types of data sources, including CSVs, webpages, documents, and databases. In this session, we walk through RAG flows in Python, starting with a simple flow and culminating in a full-stack RAG application based on Azure AI Search. Slides for this session Code repository with examples: python-openai-demos Python + AI: Vision models 📺 Watch recording Our fourth session is all about vision models! Vision models are LLMs that can accept both text and images, such as GPT-4o and GPT-4o mini. You can use these models for image captioning, data extraction, question answering, classification, and more! We use Python to send images to vision models, build a basic chat-with-images app, and create a multimodal search engine. Slides for this session Code repository with examples: openai-chat-vision-quickstart Python + AI: Structured outputs 📺 Watch recording In our fifth session, we discover how to get LLMs to output structured responses that adhere to a schema. In Python, all you need to do is define a Pydantic BaseModel to get validated output that perfectly meets your needs. We focus on the structured outputs mode available in OpenAI models, but you can use similar techniques with other model providers. Our examples demonstrate the many ways you can use structured responses, such as entity extraction, classification, and agentic workflows. Slides for this session Code repository with examples: python-openai-demos Python + AI: Quality and safety 📺 Watch recording This session covers a crucial topic: how to use AI safely and how to evaluate the quality of AI outputs. There are multiple mitigation layers when working with LLMs: the model itself, a safety system on top, the prompting and context, and the application user experience. We focus on Azure tools that make it easier to deploy safe AI systems into production. We demonstrate how to configure the Azure AI Content Safety system when working with Azure AI models and how to handle errors in Python code. Then we use the Azure AI Evaluation SDK to evaluate the safety and quality of output from your LLM. Slides for this session Code repository with examples: ai-quality-safety-demos Python + AI: Tool calling 📺 Watch recording In the final part of the series, we focus on the technologies needed to build AI agents, starting with the foundation: tool calling (also known as function calling). We define tool call specifications using both JSON schema and Python function definitions, then send these definitions to the LLM. We demonstrate how to properly handle tool call responses from LLMs, enable parallel tool calling, and iterate over multiple tool calls. Understanding tool calling is absolutely essential before diving into agents, so don't skip over this foundational session. Slides for this session Code repository with examples: python-openai-demos Python + AI: Agents 📺 Watch recording In the penultimate session, we build AI agents! We use Python AI agent frameworks such as the new agent-framework from Microsoft and the popular LangGraph framework. Our agents start simple and then increase in complexity, demonstrating different architectures such as multiple tools, supervisor patterns, graphs, and human-in-the-loop workflows. Slides for this session Code repository with examples: python-ai-agent-frameworks-demos Python + AI: Model Context Protocol 📺 Watch recording In the final session, we dive into the hottest technology of 2025: MCP (Model Context Protocol). This open protocol makes it easy to extend AI agents and chatbots with custom functionality, making them more powerful and flexible. We demonstrate how to use the Python FastMCP SDK to build an MCP server running locally and consume that server from chatbots like GitHub Copilot. Then we build our own MCP client to consume the server. Finally, we discover how easy it is to connect AI agent frameworks like LangGraph and Microsoft agent-framework to MCP servers. With great power comes great responsibility, so we briefly discuss the security risks that come with MCP, both as a user and as a developer. Slides for this session Code repository with examples: python-mcp-demo6.8KViews2likes0CommentsAgentic Code Fixing with GitHub Copilot SDK and Foundry Local
Introduction AI-powered coding assistants have transformed how developers write and review code. But most of these tools require sending your source code to cloud services, a non-starter for teams working with proprietary codebases, air-gapped environments, or strict compliance requirements. What if you could have an intelligent coding agent that finds bugs, fixes them, runs your tests, and produces PR-ready summaries, all without a single byte leaving your machine? The Local Repo Patch Agent demonstrates exactly this. By combining the GitHub Copilot SDK for agent orchestration with Foundry Local for on-device inference, this project creates a fully autonomous coding workflow that operates entirely on your hardware. The agent scans your repository, identifies bugs and code smells, applies fixes, verifies them through your test suite, and generates a comprehensive summary of all changes, completely offline and secure. This article explores the architecture behind this integration, walks through the key implementation patterns, and shows you how to run the agent yourself. Whether you're building internal developer tools, exploring agentic workflows, or simply curious about what's possible when you combine GitHub's SDK with local AI, this project provides a production-ready foundation to build upon. Why Local AI Matters for Code Analysis Cloud-based AI coding tools have proven their value—GitHub Copilot has fundamentally changed how millions of developers work. But certain scenarios demand local-first approaches where code never leaves the organisation's network. Consider these real-world constraints that teams face daily: Regulatory compliance: Financial services, healthcare, and government projects often prohibit sending source code to external services, even for analysis Intellectual property protection: Proprietary algorithms and trade secrets can't risk exposure through cloud API calls Air-gapped environments: Secure facilities and classified projects have no internet connectivity whatsoever Latency requirements: Real-time code analysis in IDEs benefits from zero network roundtrip Cost control: High-volume code analysis without per-token API charges The Local Repo Patch Agent addresses all these scenarios. By running the AI model on-device through Foundry Local and using the GitHub Copilot SDK for orchestration, you get the intelligence of agentic coding workflows with complete data sovereignty. The architecture proves that "local-first" doesn't mean "capability-limited." The Technology Stack Two core technologies make this architecture possible, working together through a clever integration called BYOK (Bring Your Own Key). Understanding how they complement each other reveals the elegance of the design. GitHub Copilot SDK The GitHub Copilot SDK provides the agent runtime, the scaffolding that handles planning, tool invocation, streaming responses, and the orchestration loop that makes agentic behaviour possible. Rather than managing raw LLM API calls, developers define tools (functions the agent can call) and system prompts, and the SDK handles everything else. Key capabilities the SDK brings to this project: Session management: Maintains conversation context across multiple agent interactions Tool orchestration: Automatically invokes defined tools when the model requests them Streaming support: Real-time response streaming for responsive user interfaces Provider abstraction: Works with any OpenAI-compatible API through the BYOK configuration Foundry Local Foundry Local brings Azure AI Foundry's model catalog to your local machine. It automatically selects the best available hardware acceleration—GPU, NPU, or CP, and exposes models through an OpenAI-compatible API on localhost. Models run entirely on-device with no telemetry or data transmission. For this project, Foundry Local provides: On-device inference: All AI processing happens locally, ensuring complete data privacy Dynamic port allocation: The SDK auto-detects the Foundry Local endpoint, eliminating configuration hassle Model flexibility: Swap between models like qwen2.5-coder-1.5b , phi-3-mini , or larger variants based on your hardware OpenAI API compatibility: Standard API format means the GitHub Copilot SDK works without modification The BYOK Integration The entire connection between the GitHub Copilot SDK and Foundry Local happens through a single configuration object. This BYOK (Bring Your Own Key) pattern tells the SDK to route all inference requests to your local model instead of cloud services: const session = await client.createSession({ model: modelId, provider: { type: "openai", // Foundry Local speaks OpenAI's API format baseUrl: proxyBaseUrl, // Streaming proxy → Foundry Local apiKey: manager.apiKey, wireApi: "completions", // Chat Completions API }, streaming: true, tools: [ /* your defined tools */ ], }); This configuration is the key insight: with one config object, you've redirected an entire agent framework to run on local hardware. No code changes to the SDK, no special adapters—just standard OpenAI-compatible API communication. Architecture Overview The Local Repo Patch Agent implements a layered architecture where each component has a clear responsibility. Understanding this flow helps when extending or debugging the system. ┌─────────────────────────────────────────────────────────┐ │ Your Terminal / Web UI │ │ npm run demo / npm run ui │ └──────────────┬──────────────────────────────────────────┘ │ ┌──────────────▼──────────────────────────────────────────┐ │ src/agent.ts (this project) │ │ │ │ ┌────────────────────────────┐ ┌──────────────────┐ │ │ │ GitHub Copilot SDK │ │ Agent Tools │ │ │ │ (CopilotClient) │ │ list_files │ │ │ │ BYOK → Foundry │ │ read_file │ │ │ └────────┬───────────────────┘ │ write_file │ │ │ │ │ run_command │ │ └────────────┼───────────────────────┴──────────────────┘ │ │ │ │ JSON-RPC │ ┌────────────▼─────────────────────────────────────────────┐ │ GitHub Copilot CLI (server mode) │ │ Agent orchestration layer │ └────────────┬─────────────────────────────────────────────┘ │ POST /v1/chat/completions (BYOK) ┌────────────▼─────────────────────────────────────────────┐ │ Foundry Local (on-device inference) │ │ Model: qwen2.5-coder-1.5b via ONNX Runtime │ │ Endpoint: auto-detected (dynamic port) │ └───────────────────────────────────────────────────────────┘ The data flow works as follows: your terminal or web browser sends a request to the agent application. The agent uses the GitHub Copilot SDK to manage the conversation, which communicates with the Copilot CLI running in server mode. The CLI, configured with BYOK, sends inference requests to Foundry Local running on localhost. Responses flow back up the same path, with tool invocations happening in the agent.ts layer. The Four-Phase Workflow The agent operates through a structured four-phase loop, each phase building on the previous one's output. This decomposition transforms what would be an overwhelming single prompt into manageable, verifiable steps. Phase 1: PLAN The planning phase scans the repository and produces a numbered fix plan. The agent reads every source and test file, identifies potential issues, and outputs specific tasks to address: // Phase 1 system prompt excerpt const planPrompt = ` You are a code analysis agent. Scan the repository and identify: 1. Bugs that cause test failures 2. Code smells and duplication 3. Style inconsistencies Output a numbered list of fixes, ordered by priority. Each item should specify: file path, line numbers, issue type, and proposed fix. `; The tools available during this phase are list_files and read_file —the agent explores the codebase without modifying anything. This read-only constraint prevents accidental changes before the plan is established. Phase 2: EDIT With a plan in hand, the edit phase applies each fix by rewriting affected files. The agent receives the plan from Phase 1 and systematically addresses each item: // Phase 2 adds the write_file tool const editTools = [ { name: "write_file", description: "Write content to a file, creating or overwriting it", parameters: { type: "object", properties: { path: { type: "string", description: "File path relative to repo root" }, content: { type: "string", description: "Complete file contents" } }, required: ["path", "content"] } } ]; The write_file tool is sandboxed to the demo-repo directory, path traversal attempts are blocked, preventing the agent from modifying files outside the designated workspace. Phase 3: VERIFY After making changes, the verification phase runs the project's test suite to confirm fixes work correctly. If tests fail, the agent attempts to diagnose and repair the issue: // Phase 3 adds run_command with an allowlist const allowedCommands = ["npm test", "npm run lint", "npm run build"]; const runCommandTool = { name: "run_command", description: "Execute a shell command (npm test, npm run lint, npm run build only)", execute: async (command: string) => { if (!allowedCommands.includes(command)) { throw new Error(`Command not allowed: ${command}`); } // Execute and return stdout/stderr } }; The command allowlist is a critical security measure. The agent can only run explicitly permitted commands—no arbitrary shell execution, no data exfiltration, no system modification. Phase 4: SUMMARY The final phase produces a PR-style Markdown report documenting all changes. This summary includes what was changed, why each change was necessary, test results, and recommended follow-up actions: ## Summary of Changes ### Bug Fix: calculateInterest() in account.js - **Issue**: Division instead of multiplication caused incorrect interest calculations - **Fix**: Changed `principal / annualRate` to `principal * (annualRate / 100)` - **Tests**: 3 previously failing tests now pass ### Refactor: Duplicate formatCurrency() removed - **Issue**: Identical function existed in account.js and transaction.js - **Fix**: Both files now import from utils.js - **Impact**: Reduced code duplication, single source of truth ### Test Results - **Before**: 6/9 passing - **After**: 9/9 passing This structured output makes code review straightforward, reviewers can quickly understand what changed and why without digging through diffs. The Demo Repository: Intentional Bugs The project includes a demo-repo directory containing a small banking utility library with intentional problems for the agent to find and fix. This provides a controlled environment to demonstrate the agent's capabilities. Bug 1: Calculation Error in calculateInterest() The account.js file contains a calculation bug that causes test failures: // BUG: should be principal * (annualRate / 100) function calculateInterest(principal, annualRate) { return principal / annualRate; // Division instead of multiplication! } This bug causes 3 of 9 tests to fail. The agent identifies it during the PLAN phase by correlating test failures with the implementation, then fixes it during EDIT. Bug 2: Code Duplication The formatCurrency() function is copy-pasted in both account.js and transaction.js, even though a canonical version exists in utils.js. This duplication creates maintenance burden and potential inconsistency: // In account.js (duplicated) function formatCurrency(amount) { return '$' + amount.toFixed(2); } // In transaction.js (also duplicated) function formatCurrency(amount) { return '$' + amount.toFixed(2); } // In utils.js (canonical, but unused) export function formatCurrency(amount) { return '$' + amount.toFixed(2); } The agent identifies this duplication during planning and refactors both files to import from utils.js, eliminating redundancy. Handling Foundry Local Streaming Quirks One technical challenge the project solves is Foundry Local's behaviour with streaming requests. As of version 0.5, Foundry Local can hang on stream: true requests. The project includes a streaming proxy that works around this limitation transparently. The Streaming Proxy The streaming-proxy.ts file implements a lightweight HTTP proxy that converts streaming requests to non-streaming, then re-encodes the single response as SSE (Server-Sent Events) chunks—the format the OpenAI SDK expects: // streaming-proxy.ts simplified logic async function handleRequest(req: Request): Promise { const body = await req.json(); // If it's a streaming chat completion, convert to non-streaming if (body.stream === true && req.url.includes('/chat/completions')) { body.stream = false; const response = await fetch(foundryEndpoint, { method: 'POST', body: JSON.stringify(body), headers: { 'Content-Type': 'application/json' } }); const data = await response.json(); // Re-encode as SSE stream for the SDK return createSSEResponse(data); } // Non-streaming and non-chat requests pass through unchanged return fetch(foundryEndpoint, req); } This proxy runs on port 8765 by default and sits between the GitHub Copilot SDK and Foundry Local. The SDK thinks it's talking to a streaming-capable endpoint, while the actual inference happens non-streaming. The conversion is transparent, no changes needed to SDK configuration. Text-Based Tool Call Detection Small on-device models like qwen2.5-coder-1.5b sometimes output tool calls as JSON text rather than using OpenAI-style function calling. The SDK won't fire tool.execution_start events for these text-based calls, so the agent includes a regex-based detector: // Pattern to detect tool calls in model output const toolCallPattern = /\{[\s\S]*"name":\s*"(list_files|read_file|write_file|run_command)"[\s\S]*\}/; function detectToolCall(text: string): ToolCall | null { const match = text.match(toolCallPattern); if (match) { try { return JSON.parse(match[0]); } catch { return null; } } return null; } This fallback ensures tool calls are captured regardless of whether the model uses native function calling or text output, keeping the dashboard's tool call counter and CLI log accurate. Security Considerations Running an AI agent that can read and write files and execute commands requires careful security design. The Local Repo Patch Agent implements multiple layers of protection: 100% local execution: No code, prompts, or responses leave your machine—complete data sovereignty Command allowlist: The agent can only run npm test , npm run lint , and npm run build —no arbitrary shell commands Path sandboxing: File tools are locked to the demo-repo/ directory; path traversal attempts like ../../../etc/passwd are rejected File size limits: The read_file tool rejects files over 256 KB, preventing memory exhaustion attacks Recursion limits: Directory listing caps at 20 levels deep, preventing infinite traversal These constraints demonstrate responsible AI agent design. The agent has enough capability to do useful work but not enough to cause harm. When extending this project for your own use cases, maintain similar principles, grant minimum necessary permissions, validate all inputs, and fail closed on unexpected conditions. Running the Agent Getting the Local Repo Patch Agent running on your machine takes about five minutes. The project includes setup scripts that handle prerequisites automatically. Prerequisites Before running the setup, ensure you have: Node.js 18 or higher: Download from nodejs.org (LTS version recommended) Foundry Local: Install via winget install Microsoft.FoundryLocal (Windows) or brew install foundrylocal (macOS) GitHub Copilot CLI: Follow the GitHub Copilot CLI install guide Verify your installations: node --version # Should print v18.x.x or higher foundry --version copilot --version One-Command Setup The easiest path uses the provided setup scripts that install dependencies, start Foundry Local, and download the AI model: # Clone the repository git clone https://github.com/leestott/copilotsdk_foundrylocal.git cd copilotsdk_foundrylocal # Windows (PowerShell) .\setup.ps1 # macOS / Linux chmod +x setup.sh ./setup.sh When setup completes, you'll see: ━━━ Setup complete! ━━━ You're ready to go. Run one of these commands: npm run demo CLI agent (terminal output) npm run ui Web dashboard (http://localhost:3000) Manual Setup If you prefer step-by-step control: # Install npm packages npm install cd demo-repo && npm install --ignore-scripts && cd .. # Start Foundry Local and download the model foundry service start foundry model run qwen2.5-coder-1.5b # Copy environment configuration cp .env.example .env # Run the agent npm run demo The first model download takes a few minutes depending on your connection. After that, the model runs from cache with no internet required. Using the Web Dashboard For a visual experience with real-time streaming, launch the web UI: npm run ui Open http://localhost:3000 in your browser. The dashboard provides: Phase progress sidebar: Visual indication of which phase is running, completed, or errored Live streaming output: Model responses appear in real-time via WebSocket Tool call log: Every tool invocation logged with phase context Phase timing table: Performance metrics showing how long each phase took Environment info: Current model, endpoint, and repository path at a glance Configuration Options The agent supports several environment variables for customisation. Edit the .env file or set them directly: Variable Default Description FOUNDRY_LOCAL_ENDPOINT auto-detected Override the Foundry Local API endpoint FOUNDRY_LOCAL_API_KEY auto-detected Override the API key FOUNDRY_MODEL qwen2.5-coder-1.5b Which model to use from the Foundry Local catalog FOUNDRY_TIMEOUT_MS 180000 (3 min) How long each agent phase can run before timing out FOUNDRY_NO_PROXY — Set to 1 to disable the streaming proxy PORT 3000 Port for the web dashboard Using Different Models To try a different model from the Foundry Local catalog: # Use phi-3-mini instead FOUNDRY_MODEL=phi-3-mini npm run demo # Use a larger model for higher quality (requires more RAM/VRAM) FOUNDRY_MODEL=qwen2.5-7b npm run demo Adjusting for Slower Hardware If you're running on CPU-only or limited hardware, increase the timeout to give the model more time per phase: # 5 minutes per phase instead of 3 FOUNDRY_TIMEOUT_MS=300000 npm run demo Troubleshooting Common Issues When things don't work as expected, these solutions address the most common problems: Problem Solution foundry: command not found Install Foundry Local—see Prerequisites section copilot: command not found Install GitHub Copilot CLI—see Prerequisites section Agent times out on every phase Increase FOUNDRY_TIMEOUT_MS (e.g., 300000 for 5 min). CPU-only machines are slower. Port 3000 already in use Set PORT=3001 npm run ui Model download is slow First download can take 5-10 min. Subsequent runs use the cache. Cannot find module errors Run npm install again, then cd demo-repo && npm install --ignore-scripts Tests still fail after agent runs The agent edits files in demo-repo/. Reset with git checkout demo-repo/ and run again. PowerShell blocks setup.ps1 Run Set-ExecutionPolicy -Scope Process Bypass first, then .\setup.ps1 Diagnostic Test Scripts The src/tests/ folder contains standalone scripts for debugging SDK and Foundry Local integration issues. These are invaluable when things go wrong: # Debug-level SDK event logging npx tsx src/tests/test-debug.ts # Test non-streaming inference (bypasses streaming proxy) npx tsx src/tests/test-nostream.ts # Raw fetch to Foundry Local (bypasses SDK entirely) npx tsx src/tests/test-stream-direct.ts # Start the traffic-inspection proxy npx tsx src/tests/test-proxy.ts These scripts isolate different layers of the stack, helping identify whether issues lie in Foundry Local, the streaming proxy, the SDK, or your application code. Key Takeaways BYOK enables local-first AI: A single configuration object redirects the entire GitHub Copilot SDK to use on-device inference through Foundry Local Phased workflows improve reliability: Breaking complex tasks into PLAN → EDIT → VERIFY → SUMMARY phases makes agent behaviour predictable and debuggable Security requires intentional design: Allowlists, sandboxing, and size limits constrain agent capabilities to safe operations Local models have quirks: The streaming proxy and text-based tool detection demonstrate how to work around on-device model limitations Real-time feedback matters: The web dashboard with WebSocket streaming makes agent progress visible and builds trust in the system The architecture is extensible: Add new tools, change models, or modify phases to adapt the agent to your specific needs Conclusion and Next Steps The Local Repo Patch Agent proves that sophisticated agentic coding workflows don't require cloud infrastructure. By combining the GitHub Copilot SDK's orchestration capabilities with Foundry Local's on-device inference, you get intelligent code analysis that respects data sovereignty completely. The patterns demonstrated here, BYOK integration, phased execution, security sandboxing, and streaming workarounds, transfer directly to production systems. Consider extending this foundation with: Custom tool sets: Add database queries, API calls to internal services, or integration with your CI/CD pipeline Multiple repository support: Scan and fix issues across an entire codebase or monorepo Different model sizes: Use smaller models for quick scans, larger ones for complex refactoring Human-in-the-loop approval: Add review steps before applying fixes to production code Integration with Git workflows: Automatically create branches and PRs from agent-generated fixes Clone the repository, run through the demo, and start building your own local-first AI coding tools. The future of developer AI isn't just cloud—it's intelligent systems that run wherever your code lives. Resources Local Repo Patch Agent Repository – Full source code with setup scripts and documentation Foundry Local – Official site for on-device AI inference Foundry Local GitHub Repository – Installation instructions and CLI reference Foundry Local Get Started Guide – Official Microsoft Learn documentation Foundry Local SDK Reference – Python and JavaScript SDK documentation GitHub Copilot SDK – Official SDK repository GitHub Copilot SDK BYOK Documentation – Bring Your Own Key integration guide GitHub Copilot SDK Getting Started – SDK setup and first agent tutorial Microsoft Sample: Copilot SDK + Foundry Local – Official integration sample from Microsoft782Views0likes0CommentsBuilding with Azure OpenAI Sora: A Complete Guide to AI Video Generation
In this comprehensive guide, we'll explore how to integrate both Sora 1 and Sora 2 models from Azure OpenAI Service into a production web application. We'll cover API integration, request body parameters, cost analysis, limitations, and the key differences between using Azure AI Foundry endpoints versus OpenAI's native API. Table of Contents Introduction to Sora Models Azure AI Foundry vs. OpenAI API Structure API Integration: Request Body Parameters Video Generation Modes Cost Analysis per Generation Technical Limitations & Constraints Resolution & Duration Support Implementation Best Practices Introduction to Sora Models Sora is OpenAI's groundbreaking text-to-video model that generates realistic videos from natural language descriptions. Azure AI Foundry provides access to two versions: Sora 1: The original model focused primarily on text-to-video generation with extensive resolution options (480p to 1080p) and flexible duration (1-20 seconds) Sora 2: The enhanced version with native audio generation, multiple generation modes (text-to-video, image-to-video, video-to-video remix), but more constrained resolution options (720p only in public preview) Azure AI Foundry vs. OpenAI API Structure Key Architectural Differences Sora 1 uses Azure's traditional deployment-based API structure: Endpoint Pattern: https://{resource-name}.openai.azure.com/openai/deployments/{deployment-name}/... Parameters: Uses Azure-specific naming like n_seconds, n_variants, separate width/height fields Job Management: Uses /jobs/{id} for status polling Content Download: Uses /video/generations/{generation_id}/content/video Sora 2 adapts OpenAI's v1 API format while still being hosted on Azure: Endpoint Pattern: https://{resource-name}.openai.azure.com/openai/deployments/{deployment-name}/videos Parameters: Uses OpenAI-style naming like seconds (string), size (combined dimension string like "1280x720") Job Management: Uses /videos/{video_id} for status polling Content Download: Uses /videos/{video_id}/content Why This Matters? This architectural difference requires conditional request formatting in your code: const isSora2 = deployment.toLowerCase().includes('sora-2'); if (isSora2) { requestBody = { model: deployment, prompt, size: `${width}x${height}`, // Combined format seconds: duration.toString(), // String type }; } else { requestBody = { model: deployment, prompt, height, // Separate dimensions width, n_seconds: duration.toString(), // Azure naming n_variants: variants, }; } API Integration: Request Body Parameters Sora 1 API Parameters Standard Text-to-Video Request: { "model": "sora-1", "prompt": "Wide shot of a child flying a red kite in a grassy park, golden hour sunlight, camera slowly pans upward.", "height": "720", "width": "1280", "n_seconds": "12", "n_variants": "2" } Parameter Details: model (String, Required): Your Azure deployment name prompt (String, Required): Natural language description of the video (max 32000 chars) height (String, Required): Video height in pixels width (String, Required): Video width in pixels n_seconds (String, Required): Duration (1-20 seconds) n_variants (String, Optional): Number of variations to generate (1-4, constrained by resolution) Sora 2 API Parameters Text-to-Video Request: { "model": "sora-2", "prompt": "A serene mountain landscape with cascading waterfalls, cinematic drone shot", "size": "1280x720", "seconds": "12" } Image-to-Video Request (uses FormData): const formData = new FormData(); formData.append('model', 'sora-2'); formData.append('prompt', 'Animate this image with gentle wind movement'); formData.append('size', '1280x720'); formData.append('seconds', '8'); formData.append('input_reference', imageFile); // JPEG/PNG/WebP Video-to-Video Remix Request: Endpoint: POST .../videos/{video_id}/remix Body: Only { "prompt": "your new description" } The original video's structure, motion, and framing are reused while applying the new prompt Parameter Details: model (String, Optional): Your deployment name prompt (String, Required): Video description size (String, Optional): Either "720x1280" or "1280x720" (defaults to "720x1280") seconds (String, Optional): "4", "8", or "12" (defaults to "4") input_reference (File, Optional): Reference image for image-to-video mode remix_video_id (String, URL parameter): ID of video to remix Video Generation Modes 1. Text-to-Video (Both Models) The foundational mode where you provide a text prompt describing the desired video. Implementation: const response = await fetch(endpoint, { method: 'POST', headers: { 'Content-Type': 'application/json', 'api-key': apiKey, }, body: JSON.stringify({ model: deployment, prompt: "A train journey through mountains with dramatic lighting", size: "1280x720", seconds: "12", }), }); Best Practices: Include shot type (wide, close-up, aerial) Describe subject, action, and environment Specify lighting conditions (golden hour, dramatic, soft) Add camera movement if desired (pans, tilts, tracking shots) 2. Image-to-Video (Sora 2 Only) Generate a video anchored to or starting from a reference image. Key Requirements: Supported formats: JPEG, PNG, WebP Image dimensions must exactly match the selected video resolution Our implementation automatically resizes uploaded images to match Implementation Detail: // Resize image to match video dimensions const targetWidth = parseInt(width); const targetHeight = parseInt(height); const resizedImage = await resizeImage(inputReference, targetWidth, targetHeight); // Send as multipart/form-data formData.append('input_reference', resizedImage); 3. Video-to-Video Remix (Sora 2 Only) Create variations of existing videos while preserving their structure and motion. Use Cases: Change weather conditions in the same scene Modify time of day while keeping camera movement Swap subjects while maintaining composition Adjust artistic style or color grading Endpoint Structure: POST {base_url}/videos/{original_video_id}/remix?api-version=2024-08-01-preview Implementation: let requestEndpoint = endpoint; if (isSora2 && remixVideoId) { const [baseUrl, queryParams] = endpoint.split('?'); const root = baseUrl.replace(/\/videos$/, ''); requestEndpoint = `${root}/videos/${remixVideoId}/remix${queryParams ? '?' + queryParams : ''}`; } Cost Analysis per Generation Sora 1 Pricing Model Base Rate: ~$0.05 per second per variant at 720p Resolution Scaling: Cost scales linearly with pixel count Formula: const basePrice = 0.05; const basePixels = 1280 * 720; // Reference resolution const currentPixels = width * height; const resolutionMultiplier = currentPixels / basePixels; const totalCost = basePrice * duration * variants * resolutionMultiplier; Examples: 720p (1280×720), 12 seconds, 1 variant: $0.60 1080p (1920×1080), 12 seconds, 1 variant: $1.35 720p, 12 seconds, 2 variants: $1.20 Sora 2 Pricing Model Flat Rate: $0.10 per second per variant (no resolution scaling in public preview) Formula: const totalCost = 0.10 * duration * variants; Examples: 720p (1280×720), 4 seconds: $0.40 720p (1280×720), 12 seconds: $1.20 720p (720×1280), 8 seconds: $0.80 Note: Since Sora 2 currently only supports 720p in public preview, resolution doesn't affect cost, only duration matters. Cost Comparison Scenario Sora 1 (720p) Sora 2 (720p) Winner 4s video $0.20 $0.40 Sora 1 12s video $0.60 $1.20 Sora 1 12s + audio N/A (no audio) $1.20 Sora 2 (unique) Image-to-video N/A $0.40-$1.20 Sora 2 (unique) Recommendation: Use Sora 1 for cost-effective silent videos at various resolutions. Use Sora 2 when you need audio, image/video inputs, or remix capabilities. Technical Limitations & Constraints Sora 1 Limitations Resolution Options: 9 supported resolutions from 480×480 to 1920×1080 Includes square, portrait, and landscape formats Full list: 480×480, 480×854, 854×480, 720×720, 720×1280, 1280×720, 1080×1080, 1080×1920, 1920×1080 Duration: Flexible: 1 to 20 seconds Any integer value within range Variants: Depends on resolution: 1080p: Variants disabled (n_variants must be 1) 720p: Max 2 variants Other resolutions: Max 4 variants Concurrent Jobs: Maximum 2 jobs running simultaneously Job Expiration: Videos expire 24 hours after generation Audio: No audio generation (silent videos only) Sora 2 Limitations Resolution Options (Public Preview): Only 2 options: 720×1280 (portrait) or 1280×720 (landscape) No square formats No 1080p support in current preview Duration: Fixed options only: 4, 8, or 12 seconds No custom durations Defaults to 4 seconds if not specified Variants: Not prominently supported in current API documentation Focus is on single high-quality generations with audio Concurrent Jobs: Maximum 2 jobs (same as Sora 1) Job Expiration: 24 hours (same as Sora 1) Audio: Native audio generation included (dialogue, sound effects, ambience) Shared Constraints Concurrent Processing: Both models enforce a limit of 2 concurrent video jobs per Azure resource. You must wait for one job to complete before starting a third. Job Lifecycle: queued → preprocessing → processing/running → completed Download Window: Videos are available for 24 hours after completion. After expiration, you must regenerate the video. Generation Time: Typical: 1-5 minutes depending on resolution, duration, and API load Can occasionally take longer during high demand Resolution & Duration Support Matrix Sora 1 Support Matrix Resolution Aspect Ratio Max Variants Duration Range Use Case 480×480 Square 4 1-20s Social thumbnails 480×854 Portrait 4 1-20s Mobile stories 854×480 Landscape 4 1-20s Quick previews 720×720 Square 4 1-20s Instagram posts 720×1280 Portrait 2 1-20s TikTok/Reels 1280×720 Landscape 2 1-20s YouTube shorts 1080×1080 Square 1 1-20s Premium social 1080×1920 Portrait 1 1-20s Premium vertical 1920×1080 Landscape 1 1-20s Full HD content Sora 2 Support Matrix Resolution Aspect Ratio Duration Options Audio Generation Modes 720×1280 Portrait 4s, 8s, 12s ✅ Yes Text, Image, Video Remix 1280×720 Landscape 4s, 8s, 12s ✅ Yes Text, Image, Video Remix Note: Sora 2's limited resolution options in public preview are expected to expand in future releases. Implementation Best Practices 1. Job Status Polling Strategy Implement adaptive backoff to avoid overwhelming the API: const maxAttempts = 180; // 15 minutes max let attempts = 0; const baseDelayMs = 3000; // Start with 3 seconds while (attempts < maxAttempts) { const response = await fetch(statusUrl, { headers: { 'api-key': apiKey }, }); if (response.status === 404) { // Job not ready yet, wait longer const delayMs = Math.min(15000, baseDelayMs + attempts * 1000); await new Promise(r => setTimeout(r, delayMs)); attempts++; continue; } const job = await response.json(); // Check completion (different status values for Sora 1 vs 2) const isCompleted = isSora2 ? job.status === 'completed' : job.status === 'succeeded'; if (isCompleted) break; // Adaptive backoff const delayMs = Math.min(15000, baseDelayMs + attempts * 1000); await new Promise(r => setTimeout(r, delayMs)); attempts++; } 2. Handling Different Response Structures Sora 1 Video Download: const generations = Array.isArray(job.generations) ? job.generations : []; const genId = generations[0]?.id; const videoUrl = `${root}/${genId}/content/video`; Sora 2 Video Download: const videoUrl = `${root}/videos/${jobId}/content`; 3. Error Handling try { const response = await fetch(endpoint, fetchOptions); if (!response.ok) { const error = await response.text(); throw new Error(`Video generation failed: ${error}`); } // ... handle successful response } catch (error) { console.error('[VideoGen] Error:', error); // Implement retry logic or user notification } 4. Image Preprocessing for Image-to-Video Always resize images to match the target video resolution: async function resizeImage(file: File, targetWidth: number, targetHeight: number): Promise<File> { return new Promise((resolve, reject) => { const img = new Image(); const canvas = document.createElement('canvas'); const ctx = canvas.getContext('2d'); img.onload = () => { canvas.width = targetWidth; canvas.height = targetHeight; ctx.drawImage(img, 0, 0, targetWidth, targetHeight); canvas.toBlob((blob) => { if (blob) { const resizedFile = new File([blob], file.name, { type: file.type }); resolve(resizedFile); } else { reject(new Error('Failed to create resized image blob')); } }, file.type); }; img.onerror = () => reject(new Error('Failed to load image')); img.src = URL.createObjectURL(file); }); } 5. Cost Tracking Implement cost estimation before generation and tracking after: // Pre-generation estimate const estimatedCost = calculateCost(width, height, duration, variants, soraVersion); // Save generation record await saveGenerationRecord({ prompt, soraModel: soraVersion, duration: parseInt(duration), resolution: `${width}x${height}`, variants: parseInt(variants), generationMode: mode, estimatedCost, status: 'queued', jobId: job.id, }); // Update after completion await updateGenerationStatus(jobId, 'completed', { videoId: finalVideoId }); 6. Progressive User Feedback Provide detailed status updates during the generation process: const statusMessages: Record<string, string> = { 'preprocessing': 'Preprocessing your request...', 'running': 'Generating video...', 'processing': 'Processing video...', 'queued': 'Job queued...', 'in_progress': 'Generating video...', }; onProgress?.(statusMessages[job.status] || `Status: ${job.status}`); Conclusion Building with Azure OpenAI's Sora models requires understanding the nuanced differences between Sora 1 and Sora 2, both in API structure and capabilities. Key takeaways: Choose the right model: Sora 1 for resolution flexibility and cost-effectiveness; Sora 2 for audio, image inputs, and remix capabilities Handle API differences: Implement conditional logic for parameter formatting and status polling based on model version Respect limitations: Plan around concurrent job limits, resolution constraints, and 24-hour expiration windows Optimize costs: Calculate estimates upfront and track actual usage for better budget management Provide great UX: Implement adaptive polling, progressive status updates, and clear error messages The future of AI video generation is exciting, and Azure AI Foundry provides production-ready access to these powerful models. As Sora 2 matures and limitations are lifted (especially resolution options), we'll see even more creative applications emerge. Resources: Azure AI Foundry Sora Documentation OpenAI Sora API Reference Azure OpenAI Service Pricing This blog post is based on real-world implementation experience building LemonGrab, my AI video generation platform that integrates both Sora 1 and Sora 2 through Azure AI Foundry. The code examples are extracted from production usage.502Views0likes0CommentsThe Hidden Memory Architecture of LLMs
Your LLM is not running out of intelligence. It is often hitting context and runtime memory limits. I’m Hazem Ali — Microsoft AI MVP, Distinguished AI and ML Engineer / Architect, and Founder and CEO of Skytells. I’ve built and led engineering work that turns deep learning research into production systems that survive real-world constraints. I speak at major conferences and technical communities, and I regularly deliver deep technical sessions on enterprise AI and agent architectures. If there’s one thing you’ll notice about me, it’s that I’m drawn to the deepest layers of engineering, the parts most teams only discover when systems are under real pressure. My specialization spans the full AI stack, from deep learning and system design to enterprise architecture and security. One of the most distinctive parts of that work lives in the layer most people don’t see in demos: inference runtimes, memory and KV-cache behavior, serving architecture, observability, and zero-trust governance. So this article is written from that lens: translating “unexpected LLM behavior” into engineering controls you can measure, verify, and enforce. I’ll share lessons learned and practical guidance based on my experience. Where latency is percentiles, not averages. Where concurrency is real. Where cost has a curve. Where one bad assumption turns into an incident. That is why I keep repeating a simple point across my writing. When AI fails in production, it usually isn’t because the model is weak. It is because the architecture around it was never built for real conditions. I wrote about that directly in AI Didn’t Break Your Production, Your Architecture Did. If you have not read it yet, it will give you the framing. This article goes one layer deeper, So, think of this as an engineering deep-dive grounded in published systems work. Because the subsystem that quietly decides whether your GenAI stays stable under pressure is memory. Not memory as a buzzword. Memory as the actual engineering stack you are shipping: prefill and decode behavior, KV cache growth, attention budgets, paging and fragmentation, prefix reuse, retrieval tiers, cache invalidation, and the trust boundaries that decide what is allowed into context and what is not. That stack decides time to first token, tokens per second, throughput, tail latency, and cost per request. It also decides something people rarely connect to architecture: whether the agent keeps following constraints after a long session, or slowly drifts because the constraints fell out of the effective context. If you have watched a solid agent become unreliable after a long conversation, you have seen this. If you have watched a GPU sit at low utilization while tokens stream slowly, you have seen this. If you increased context length and your bill jumped while quality did not, you have seen this. So here is the goal of this piece. Turn the hidden memory mechanics of LLMs into something you can design, measure, and defend. Not just vaguely understand. Let’s break it down. A quick grounding: What evolved, and what did not! The modern LLM wave rides on the Transformer architecture introduced in Attention Is All You Need. What changed since then is not the core idea of attention. What changed is the engineering around it: kernels got smarter about memory movement inference got separated into phases and pipelines KV cache went from a tensor to an allocator problem serving systems started looking like OS schedulers So yes, the model evolved. But the deeper truth is this: LLM performance is now strongly shaped by memory behavior, not just FLOPs. That is not a vibe. It is why whole research lines exist around IO-aware attention and KV cache management. A Story from CognitionX 2025 This happened live at CognitionX Dubai Conference 2025 Most CognitionX events are community-focused on engineering-first learning, turning modern AI and cloud capabilities, including Microsoft technologies, into practical systems people can build, measure, and operate, bringing together Microsoft MVPs and practitioners to share proven patterns and hands-on best practices. I wanted to land a point in a way engineers can’t unsee.. GenAI performance is often constrained by the serving system (memory, bandwidth, scheduling, batching, and initialization paths) before it is constrained by model quality. So I ran a live demo on an NVIDIA A100 80GB instance. Before anything, we intentionally warmed the runtime. The very first request on a fresh process or fresh GPU context can include one-time overhead that is not representative of steady-state inference things like model weight loading, CUDA context creation, kernel/module initialization, allocator warm-up, and framework-level graph/runtime setup. I didn’t want the audience to confuse “first-request overhead” with actual steady-state behavior. Then I started with a clean run: a short input, fast output, stable behavior. This is what most demos show: a model that looks powerful and responsive when prompt length is small, concurrency is low, and runtime state is minimal. > After that, I changed one variable on purpose. I kept adding constraints and context exactly the way real users do: more requirements, more follow-ups, more iterations back to back. Same model, same serving stack, same GPU. The only thing that changed was the amount of context being processed and retained by the runtime across tokens, which increases memory pressure and reduces scheduling flexibility. You could see the system react in measurable ways. As context grew and request patterns became less predictable, end-to-end latency increased and sustained throughput dropped, and the available memory headroom tightened. Nothing “mystical” happened to the model. We simply pushed the serving system into a regime where it was more constrained by memory footprint, memory bandwidth, batching efficiency, and scheduler behavior than by raw compute. Then I connected it directly to LLM inference mechanics. Text generation follows the same pattern, except the dominant runtime state has a name: the KV cache. Findings During prefill, the model processes the full prompt to initialize attention state and populate the KV cache. During decode, that state is reused and extended one token at a time. KV cache memory grows linearly with sequence length per request, and it also scales with the number of concurrent sequences and with model configuration details such as number of layers, number of attention heads, head dimension, and dtype (FP16/BF16/FP8, etc.). As prompt length and concurrency increase, the serving bottleneck often shifts from pure compute to system-level constraints: HBM bandwidth and access patterns, KV residency and paging behavior, allocator efficiency and fragmentation, and batching and scheduling dynamics. That is the mental model behind the rest of this article. The mental model that fixes most confusion LLM inference is the runtime forward pass where the model turns input tokens into a probability distribution for the next token. It runs in two phases: prefill (process the whole prompt once and build KV cache) then decode (generate tokens one-by-one while reusing KV cache). Performance and stability are dominated by context limits + KV cache memory/bandwidth, not just compute. The key is that inference is not one big compute job. It is one prompt pass, then many per-token passes. Prefill builds reusable state. Decode reuses and extends it, token by token, while repeatedly reading KV cache. Once you see it this way, production behavior becomes predictable, especially why long context and high concurrency change throughput and tail latency. LLM inference has two phases Prefill You process the full prompt tokens in parallel, and you create the KV cache. Decode You generate tokens autoregressively, one token at a time, reusing the KV cache. Now the first real punchline: Prefill is compute heavy. Decode is memory hungry. Decode reuses prior keys and values, which means you are constantly reading KV cache from GPU memory. That is why decode often becomes memory-bandwidth bound and tends to underutilize GPU compute. So when people ask why the GPU looks bored while tokens are slowly streaming, the answer is usually: Because decode is waiting on memory. Each generated token forces the model to pull past keys and values from KV cache, layer by layer, from GPU memory. So even if your GPU has plenty of compute left, throughput can stall on memory bandwidth and memory access patterns. KV cache is not an optimization. It is the runtime state In a Transformer decoder, each layer produces keys and values per token. If you had to recompute those for every new token, latency would explode. So we cache K and V. That cache grows with sequence length. That is the KV cache, Now here is the engineering detail that matters more than most people admit: The KV cache is one of the largest pieces of mutable state in LLM inference. And it is dynamic. It grows per request, per turn, per decoding strategy. This is exactly the problem statement that the vLLM PagedAttention paper attacks (arXiv) High-throughput serving needs batching, but KV cache memory becomes huge and changes shape dynamically, and naive management wastes memory through fragmentation and duplication. Why this starts behaving like distributed memory Well, A single GPU can only hold so much. At scale, you do all the usual tricks: batching continuous batching kv reuse prefix caching paging speculative decoding sharding multi GPU scheduling And once you do that, your system starts looking like a memory manager. Not metaphorically. Literally. The constraint isn’t just weights, it’s live KV cache, which grows with tokens and concurrency. So serving becomes memory admission control, can you accept this request without blowing the KV budget and collapsing batch size? PagedAttention explicitly takes the OS route: Paging KV into fixed-size blocks to avoid fragmentation and keep packing/batching stable under churn. (arXiv) That is not blog language. That is the core design. So if you want a rare angle that most people cannot talk about, here it is: GenAI serving is OS design wearing a Transformer costume. It means the hardest production problems stop being attention math and become OS problems: admission control, paging/fragmentation, scheduling (prefill vs decode), and isolation for shared caches. Paging: the KV cache allocator is the hidden bottleneck Paging shows up when you stop pretending every request has a clean, contiguous memory layout. Real traffic creates fragmentation. Variable length sequences create uneven allocations. And once you batch requests, wasted KV memory becomes lost throughput. Let’s get concrete. The classical failure mode: fragmentation If you allocate KV cache as big contiguous tensors per request, two things happen: you over allocate to plan for worst case length you fragment memory as requests come and go PagedAttention addresses this by storing KV cache in non contiguous blocks allocated on demand, eliminating external fragmentation by making blocks uniform, and reducing internal fragmentation by using smaller blocks. The vLLM paper also claims near zero waste in KV cache memory with this approach, and reports 2 to 4 times throughput improvements compared to prior systems in its evaluation. If you are building your own serving stack and you do not understand your KV allocator, you are basically shipping an OS with malloc bugs and hoping Kubernetes fixes it. It will not. Attention Budgets: The real meaning of context limits Context window is often marketed like a feature. In production it behaves like a budget that you spend. > Spend it on the wrong tokens and quality drops. > Spend too much of it and performance collapses under concurrency. Most people talk about context window like it is a product feature. Engineers should talk about it like this: Context is an attention budget with quadratic pressure. The FlashAttention paper opens with the key fact: Transformers get slow and memory hungry on long sequences because self-attention has quadratic time and memory complexity in sequence length. That pressure shows up in two places: Attention compute and intermediate memory Naive attention wants to touch (and often materialize) an N×N attention structure. As N grows, the cost curve explodes. KV cache is linear in size, but decode bandwidth scales with length KV cache grows with tokens (O(n)), but during decode every new token repeatedly reads more past KV. Longer contexts mean more memory traffic per token and higher tail-latency risk under load. FlashAttention exists because naive attention spends too much time moving data between HBM and SRAM, so it uses tiling to reduce HBM reads/writes and avoids materializing the full attention matrix. So when you choose longer contexts, you are not choosing more text. You are choosing: more KV cache to store more memory bandwidth pressure during decode more IO pressure inside attention kernels more tail latency risk under concurrency This is why context length is not a free upgrade. It is an architectural trade. Prefill decode disaggregation: when memory becomes a network problem Prefill–decode disaggregation is when you run the prefill phase on one GPU/node, then ship the resulting KV cache (or a reference to it) to a different GPU/node that runs the decode phase. So instead of one engine doing prefill → decode end-to-end, you split inference into two stages with a KV transfer boundary in the middle. The reason people do it: prefill is typically compute/throughput-oriented, while decode is latency + memory-bandwidth-oriented, so separating them lets you size and schedule hardware differently, but it turns KV into distributed state you must move, track, and retire safely. Once you treat prefill and decode as different phases, the next question is obvious: > Should they run on the same device? In many systems the answer becomes no, because the resource profiles differ. But the moment you split them, KV cache becomes a transferable object and decode is now gated by network tail latency as much as GPU speed. Some systems split them so prefill happens on one engine and decode on another. This is literally called prefill decode disaggregation, and technical reports describe it as splitting inference into a prefill stage and a decode stage across different GPUs or nodes, including cross-engine KV cache transfer. Now you have a new engineering reality: The KV cache becomes a distributed object. That means you inherit distributed systems issues: serialization / layout choices transfer overhead and tail latency correctness: ordering, cancellation, retries, duplication, versioning admission control under congestion / backpressure isolation between tenants If you are reading this as a CTO or SRE, this is the part you should care about. Because this is where systems die in production. Consistency: what it even means for KV cache Consistency is not a buzzword here, It is the difference between safe reuse and silent corruption. When you reuse KV state, you are reusing computation under assumptions. If those assumptions are wrong, you may get fast answers that are simply not equivalent to running the model from scratch. Let’s define terms carefully, In classic distributed systems, consistency is about agreement on state. In LLM serving, KV cache consistency usually means these constraints: Causal alignment The KV cache you reuse must correspond exactly to the same prefix tokens (same token IDs, same order, same positions) the model already processed. Parameter + configuration alignment KV computed under one model snapshot/config must not be reused under another: different weights, tokenizer, RoPE/positioning behavior, quantization/dtype, or other model-level settings can invalidate equivalence. Conditioning alignment If the prompt includes more than text (multimodal inputs, system/tool metadata), the cache key must include all conditioning inputs, Otherwise “same text prefix” can still be a different request. (This is a real-world footgun in practice.) This is why prefix caching is implemented as caching KV blocks for processed prefixes and reusing them only when a new request shares the same prefix, so it can skip computation of the shared part. And the vLLM docs make an explicit claim: prefix caching is widely used, is “almost a free lunch,” and does not change model outputs when the prefix matches. The moment you relax the prefix equality rule, you are not caching. You are approximating. That is a different system. So here is the consistency rule that matters: Only reuse KV state when you can prove token identity, not intent similarity. Performance without proof is just corruption with low latency. — Hazem Ali So my recommendation, treat KV reuse as a correctness feature first, not a speed feature. Cache only when you can prove token identity, and label anything else as approximation with explicit guardrails. Multi-tenancy: The memory security problem nobody wants to own Most senior engineers avoid this layer because it’s as unforgiving as memory itself, and I get why even principals miss it. This is deep-systems territory, where correctness is invisible until it breaks. However, let me break it down and make it easy for you to reason about. Memory is not only a performance layer, It is also a security surface. Yes, you read that right. Memory is not only a performance layer. It is also a security surface. I remember my session at AICO Dubai 2025, where the whole point was Zero-Trust Architecture. What most teams miss is that the exact same Zero-Trust logic applies one layer deeper, at the memory level as well. Once you batch users, cache prefixes, and reuse state, you are operating a multi-tenant platform whether you admit it or not. That means isolation and scope become first-class design constraints. If you ignore this, performance optimizations become exposure risks. Now we get to the part most GenAI articles avoid. If your serving layer does any form of cross-request reuse, batching, or shared caches, then you have a trust boundary issue. The boundary isn’t just the model. It is the serving stack: the scheduler, the cache namespace, the debug surface, and the logs. User → serving → tenant-scoped cache → tools/data. Performance wants sharing; security demands scoping. In my Zero-Trust agent article, I framed the mindset clearly: do not trust the user, the model, the tools, the internet, or the documents you ground on, and any meaningful action must have identity, explicit permissions, policy checks outside the prompt, and observability. That same mindset applies here. Because KV cache can become a leakage channel if you get sloppy: cross-tenant prefix caching without strict scoping and cache key namespaces shared batch scheduling that can leak metadata through timing and resource signals debug endpoints that expose tokenization details or cache keys logs that accidentally store prompts, prefixes, or identifiers I am not claiming a specific CVE here, I am stating the architectural risk class. And the mitigation is the same pattern I already published: Once an agent can call tools that mutate state, treat it like a privileged service, not a chatbot. - Hazem Ali I would extend that line to serving, Once your inference stack shares memory state across users, treat it like a multi-tenant platform, not a demo endpoint. Speculative decoding: latency tricks that still depend on memory Speculative decoding is a clean example of a pattern you’ll keep seeing. A lot of speedups aren’t about changing the model at all. They’re about changing how you schedule work and how you validate tokens. Speculative decoding flow. A draft model proposes N tokens; the target model verifies them in parallel; accepted tokens are committed and extend KV; rejected tokens fall back to standard decode. But even when you make decode faster, you still pay the memory bill: KV reads, KV writes, and state that keeps growing. Speculative decoding is one of the most practical ways to speed up decode without touching the target model. The idea is simple: a smaller draft model proposes N tokens, then the larger target model verifies them in parallel. If most of them get accepted, you effectively get multiple tokens per expensive target-model step, while still matching the target distribution. It helps, but it doesn’t make memory go away: verification still has to attend over the current prefix and work against KV state acceptance rate is everything: poor alignment means more rejections and less real gain batching and scheduler details matter a lot in production (ragged acceptance, bookkeeping, and alignment rules can change the outcome) Figure 12B, Speedup vs acceptance rate (and the memory floor). Higher acceptance drives real gains, but KV reads/writes and state growth remain a bandwidth floor that doesn’t disappear. So speculative decoding isn’t magic. 😅 It’s a scheduling + memory strategy dressed as an algorithm. If you turn it on, benchmark it under your actual workload. Even practical inference guides call out that results depend heavily on draft/target alignment and acceptance rate you measure it, you don’t assume it. Azure: Why it matters here? Azure matters here for one reason: it gives you production control points that map directly to the failure modes we’ve been talking about memory pressure, batching behavior, cache scope, isolation, and ops. Not because you can buy a bigger GPU. Because in production, survivability comes from control points. 1. Foundry Agent Service as a governed agent surface The point isn’t agents as a feature. The point is that orchestration changes memory patterns and operational risk. According to the product documentation, Foundry Agent Service is positioned as a platform to design, deploy, and scale agents, with built-in integration to knowledge sources (e.g., Bing, SharePoint, Fabric, Azure AI Search) and a large action surface via Logic Apps connectors. Why that matters in this article: once you add tools + retrieval + multi-step execution, you amplify token volume and state. 2. Tools + grounding primitives you can actually audit Grounding is not free. It expands context, increases prefill cost, and changes what you carry into decode. According to the latest documentation, Foundry’s tools model explicitly separates knowledge tools and public web grounding That separation is operationally important: it gives you clearer “what entered the context” boundaries, so when quality drifts, you can debug whether it’s retrieval/grounding vs serving/memory. 3. AKS + MIG: when KV cache becomes a deployment decision GPU utilization isn’t just “do we have GPUs?” It’s tenancy, isolation, and throughput under hard memory budgets. According to AKS Docs, Azure AKS supports Multi-Instance GPU (MIG), where supported NVIDIA GPUs can be partitioned into multiple smaller GPU instances, each with its own compute slices and memory. That turns KV cache headroom from a runtime detail into a deployment constraint. This is exactly where the KV cache framing becomes useful: Smaller MIG slices mean tighter KV cache budgets Batching must respect per-slice memory headroom Paging and prefix caching become more important You are effectively right-sizing memory domains 4. Managed GPU nodes: reducing the ops entropy around inference A lot of production pain lives around the model: drivers, plugins, telemetry, node lifecycle. As documented, AKS now supports fully managed GPU nodes (preview) that install the NVIDIA driver, device plugin, and DCGM metrics exporter by default, reducing the moving parts in the layer that serves your KV-heavy workloads. Architectural Design: AI as Distributed Memory on Azure Now we get to the interesting part: turning the ideas into a blueprint you can actually implement. The goal is simple, keep control plane and data plane clean, and treat memory as a first-class layer. If you do that, scaling becomes a deliberate engineering exercise instead of a firefight. The moment you treat inference as a multi-tenant memory system, not a model endpoint, you stop chasing incidents and start designing control. — Hazem Ali Control plane: The Governance Unit Use Foundry Hubs/Projects as the governance boundary: a place to group agents, model deployments, tools, and access control so RBAC, policies, and monitoring attach to a single unit of ownership. Then enforce identity + least privilege for any tool calls outside the prompt, aligned with your zero-trust framing. Data plane: Where tokens turn into latency Pick one of two concrete paths: Option A: Managed models + managed orchestration Use Foundry Models / model catalog with Foundry Agent Service orchestration when you want faster time-to-prod and more managed control points. Option B: Self-hosted inference on AKS Run inference on AKS with your serving stack (e.g., vLLM + PagedAttention), and add MIG slicing where it matches your tenancy model, because KV budget becomes an actual scheduling constraint. Memory layer decisions Long prompts + repeated prefixes: enable prefix caching, and scope it properly per tenant / per model config. OOM or low batch size: treat KV cache as an allocator problem, adopt paging strategies (PagedAttention-style thinking). Tail latency spikes: consider separating prefill and decode where it fits, but accept KV becomes a distributed object with transfer + consistency overhead. Decode feels slow / GPU looks bored: consider speculative decoding, but benchmark it honestly under your workload and acceptance rate. Runtime Observability: Inside the Serving Memory Stack Before we get into metrics, a quick warning, This is where GenAI stops being a model you call and becomes a system you operate. The truth won’t show up in prompt tweaks or averages. It shows up one layer deeper, in queues, schedulers, allocators, and the KV state that decides whether your runtime stays stable under pressure. Remember what I told you above? latency is percentiles, not averages. So if you can’t see memory behavior, you can’t tune it, and you’ll keep blaming the model for what the serving layer is doing. Most teams instrument the model and forget the runtime. That’s backwards. This whole article is about the fact that performance is often constrained by the serving system (memory, bandwidth, scheduling, batching) before it’s constrained by model quality, and the dominant runtime state is the KV cache. So if you want to run an AI like an engineer, you track: TTFT (time to first token) Mostly prefill + queueing/scheduling. This is where the system feels slow starts. TPOT / ITL (time per output token / inter-token latency) Mostly decode behavior. This is where memory bandwidth and KV reads show up hardest. KV cache footprint + headroom During decode, KV grows with sequence length and with concurrency. Track how much VRAM is living state vs available runway. KV fragmentation / allocator efficiency Because your max batch size is often limited by allocator reality, not theoretical VRAM. Batch size + effective throughput (system tokens/sec) If throughput dips as contexts get longer, you’re usually watching memory pressure and batching efficiency collapse, not model randomness. Prefix cache hit rate This is where prompt engineering becomes performance engineering. When done correctly, prefix caching skips recomputing shared prefixes. Tail latency under concurrency (p95/p99) Because production is where mostly fine still means “incident.” These are the levers that make GenAI stable, everything else is vibes. Determinism Under Load: When the Serving Runtime Changes the Output In well-controlled setups, an LLM can be highly repeatable. But under certain serving conditions, especially high concurrency and dynamic/continuous batching.. You may observe something that feels counter-intuitive.. Same model. Same request. Same parameters. Different output. First, Let me clarify something here, I'm not saying here that LLMs are unreliable by design. I'm saying something more precise, and more useful. Reproducibility is a systems property. Why? Because in real serving, the model is only one part of the computation. What actually runs is a serving runtime, batching and scheduling decisions, kernel selection, numeric precision paths, and memory pressure. Under load, those factors can change the effective execution path. And if the runtime isn’t deterministic enough for the guarantees you assume, then “same request” does not always mean “same execution.” This matters because AI is no longer a toy. It’s deployed across enterprise workflows, healthcare, finance, and safety-critical environments. Places where small deviations aren’t “interesting,” they’re risk. In precision-critical fields like healthcare, tiny shifts can matter, not because every use case requires bit-identical outputs, but because safety depends on traceability, validation, and clear operating boundaries. When systematic decisions touch people’s lives, you don’t want “it usually behaves.” You want measurable guarantees, clear operating boundaries, and engineering controls. — Hazem Ali 1. First rule: “Same request” must mean same token stream + same model configuration Before blaming determinism, verify the request is identical at the level that matters: Same tokenizer behavior and token IDs (same text ≠ same tokens across versions/config) Same system prompt/template/tool traces (anything that enters the final serialized prompt) Same weights snapshot + inference configuration (dtype/quantization/positioning settings that affect numerics) If you can’t prove token + config equivalence, don’t blame hardware yet, you may be debugging input drift. Once equivalence is proven, runtime nondeterminism becomes the prime suspect. Prove byte-level equivalence before blaming runtime: same_text_prompt ≠ same_token_ids same_model_name ≠ same_weights_snapshot + quantization/dtype + RoPE/position config same_api_call ≠ same_final_serialized_context (system + tools + history) Common failure modes in the wild: Tokenizer/version changes → different token IDs Quantization/dtype paths → different numerics (often from the earliest layers) RoPE/position config mismatches → representation drift across the sequence Verify (practically): Hash the final serialized prompt bytes Hash the token ID sequence Log/hash the model revision + tokenizer revision + dtype/quantization + RoPE/position settings + decode config across runs 2. Temperature=0 reduces randomness, but it does not guarantee bit-identical execution Greedy decoding { temperature = 0 } is deterministic only if the logits are identical at every step. What greedy actually removes is one source of variability, sampling. It does not guarantee identical results by itself, because the logits are produced by a GPU runtime that may not be strictly deterministic under all serving conditions. Deterministic only if the logits match exactly next_id = logits.argmax() # Deterministic only if logits are bit-identical. # In practice, kernel selection, parallel reductions, atomic operations, # and precision paths can introduce tiny rounding differences # that may flip a borderline argmax. Reality? greedy fixes the decision rule “pick the max”. The serving runtime still controls the forward-pass execution path that produces the logits. If you need strict repeatability, you must align the runtime: deterministic algorithm settings where available, consistent library/toolkit behavior, and stable kernel/math-mode choices across runs. But GPU stacks do not automatically guarantee bit-identical logits across runs. **PyTorch** documents that reproducibility can require avoiding nondeterministic algorithms, and it provides ``deterministic`` enforcement that forces deterministic algorithms where available and errors when only nondeterministic implementations exist. So the accurate statement is: [ temp=0 ] makes the decoding rule deterministic, but it doesn’t make the runtime deterministic. 3. Why tiny runtime differences can become big output differences Sometimes a tiny runtime delta stays tiny. Sometimes it cascades. The difference is autoregressive decoding plus sequence length (prompt + generated tokens within the context window). During decode, the model generates one token at a time, and each chosen token is appended back into the context for the next step: So if two runs differ at a single step, because two candidates were near-tied and a tiny numeric delta flipped the choice then the prefixes diverge: From that moment on, the model is conditioning on a different history, so future token distributions can drift. This is not “model mood.” It’s a direct consequence of the autoregressive feedback loop. Where the context window matters is simple and fully mechanical: A longer sequence means more decode steps. More steps means more opportunities for near-ties where a tiny delta can flip a decision. Once a token flips, the rest of the generation can follow a different trajectory because the prefix is now different. So yes: small runtime differences can become big output differences—especially in long generations and long contexts. For example, this snippet demonstrates two facts: Near-tie + tiny delta can flip argmax One flipped choice can cause trajectory divergence in an autoregressive loop. import numpy as np # 1) Near-tie: tiny perturbation can flip argmax z = np.array([0.5012, 0.5008, 0.1, -0.2]) # top-2 are close a = int(np.argmax(z)) b = int(np.argsort(z)[-2]) margin = z[a] - z[b] eps = 3e-4 # tiny perturbation scale print("Top:", a, "Second:", b, "Margin:", margin) # Worst-case-style delta: push top down, runner-up up (illustrative) delta = np.zeros_like(z) delta[a] -= eps delta[b] += eps z2 = z + delta print("Argmax before:", int(np.argmax(z)), "after tiny delta:", int(np.argmax(z2))) # 2) Autoregressive divergence (toy transition model) rng = np.random.default_rng(0) V, T = 8, 30 W = rng.normal(size=(V, V)) # logits for next token given current token def next_token(prev: int, tweak: bool = False) -> int: logits = W[prev].copy() if tweak: top = int(np.argmax(logits)) second = int(np.argsort(logits)[-2]) logits[top] -= 1e-3 logits[second] += 1e-3 return int(np.argmax(logits)) yA = [0] yB = [0] inject_step = 3 for t in range(1, T): yA.append(next_token(yA[-1], tweak=False)) yB.append(next_token(yB[-1], tweak=(t == inject_step))) # single tiny change once first_div = next((i for i, (x, y) in enumerate(zip(yA, yB)) if x != y), None) print("First divergence step:", first_div) print("Run A:", yA) print("Run B:", yB) This toy example isn’t claiming GPU deltas always happen or always flip tokens, only the verified mechanism, near-ties exist, argmax flips are possible if logits differ, and autoregressive decoding amplifies a single early difference into a different continuation. To visualize what’s happening exactly, look at this diagram. On the left, it shows the decode loop as a stateful sequence generator: at step t the model produces logits zt, We pick the next token yt (greedy or sampling), then that token is appended to the prefix and becomes part of the next step’s conditioning. That feedback loop is the key, one token is not “just one token”, it becomes future context. On the right, the diagram highlights the failure mode that surprises people in serving: when two candidates are near-tied, a tiny numeric delta (from runtime execution-path differences under load) can flip the choice once. After that flip, the two runs are no longer evaluating the same prefix, so the distributions naturally drift. With a longer context window and longer generations, you simply have more steps where near-ties can occur and more opportunity for a single flip to branch the trajectory. That’s the point to internalize. The runtime doesn’t need to “break” the model to change the output. It only needs to nudge one early decision in a near-tie autoregressive conditioning does the rest. 4. Under concurrency, serving can change the execution path (and that can change results) Once you go online, the request is not executed alone. It enters a scheduler. Under load, the serving layer is allowed to reshape work to hit latency/throughput goals: Continuous/dynamic batching: requests arrive at different times, get grouped differently, and may be processed with different batch composition or ordering. Chunked or staged execution: some systems split or chunk prefill work to keep the pipeline moving and to avoid blocking decode. Runtime features that change what’s computed and when: prefix caching, speculative decoding, verification passes, paging, and other optimizations can change the shape of the forward-pass workload for “the same” logical request. None of that automatically means outputs must differ. The point is narrower and more important: If batch shape, scheduling, or kernel/math paths can change under pressure, then the effective execution path can change. And repeatability becomes a property of that path, not of your request text. This is exactly why vLLM documents that it does not guarantee reproducibility by default for performance reasons, and points to Batch Invariance when you need outputs to be independent of batch size or request order in online serving. 5. Nondeterminism isn’t folklore. The stack literally tells you it exists If you’ve ever looked at two runs that should match and thought, let me put it very clear, “This doesn’t make sense.” 👈 That reaction is rational. Your engineering brain is detecting a missing assumption. The missing assumption is that inference behaves like a pure function call. In real serving, determinism is not a property of the model alone. It’s a property of the full compute path. Framework level: what the software stack is willing to guarantee At the framework layer, reproducibility is explicitly treated as conditional. PyTorch documents that fully reproducible results are not guaranteed across releases or platforms, and it provides deterministic controls that can force deterministic algorithms where available. The important detail is that when you demand determinism, PyTorch may refuse to run an operation if only nondeterministic implementations exist. That’s not a bug. That’s the framework being honest about the contract you asked for. This matters because it draws a clean boundary: You can make the decision rule deterministic, but you still need the underlying compute path to be deterministic for bit-identical outputs. Now lets dive deeper into the most interesting part here, The GPU Level, And yes, i do understand how complex it is, but let me break it down in details. GPU level: where tiny numeric deltas can come from Now lets go one a bit deeper. A lot of GPU deep learning kernels rely on heavy parallelism, and many of the primitives inside them are reductions and accumulations across thousands of threads. Floating-point arithmetic is not strictly order independent, so if the accumulation order changes, you can get tiny rounding differences even with identical inputs. cuDNN treats this as a real engineering topic. Its documentation explicitly discusses determinism and notes that bitwise reproducibility is not guaranteed across different GPU architectures. Most of the time, these deltas are invisible. But decode is autoregressive. If the model hits a near-tie between candidates, a tiny delta can flip one token selection once. After that, the prefixes diverge, and every subsequent step is conditioned on a different history. So the runs naturally drift. That’s mechanics, not “model mood.” Why you notice it more under concurrency Under light traffic, your serving path often looks stable. Under real traffic, it adapts. Batch shape, request interleaving, and scheduling decisions can change across runs. Some stacks explicitly acknowledge this tradeoff. vLLM, for example, documents that it does not guarantee reproducible results by default for performance reasons, and it points to batch-invariance mechanisms when you need outputs that are insensitive to batching and scheduling variation in online serving. The correct interpretation So the right interpretation is not that the model became unreliable. It’s this: You assumed repeatability was a property of the request. In serving, repeatability is a property of the execution path. And under pressure, the execution path is allowed to change. 6. What engineering determinism looks like when you take it seriously Most teams say they want determinism. What they often mean is: “I want it stable enough that nobody notices.” That’s not a guarantee. That’s a hope. If reproducibility matters, treat it like a contract. A real contract has three parts. 1. Name the guarantee you actually need Different guarantees are different problems: Repeatable run-to-run on the same host Repeatable under concurrency (batch/order effects) Repeatable across replicas and rollouts Bitwise repeatable vs “functionally equivalent within tolerance” If you don’t name the target, you can’t validate it. 2. Lock the execution envelope, not just the prompt The envelope is everything that can change the compute path: Final serialized context (system, tools, history, templates) Token IDs Model snapshot / revision Tokenizer revision Precision and quantization path Positioning / RoPE configuration Serving features that reshape work (batching policy, caching, paging, speculative verification) This is exactly why PyTorch calls out that reproducibility is conditional across platforms/releases, and why deterministic enforcement can fail fast when a deterministic implementation doesn’t exist. It’s also why vLLM documents reproducibility as something you must explicitly configure for, and highlights batch invariance for reducing batch/scheduling sensitivity. 3. Make determinism observable, so it stops being a debate This is where teams usually lose time: they only notice drift after users see it. Treat it like any other system property: instrument it. Correlate divergence with what you already measure: Batch shape and scheduling conditions TTFT and TPOT KV headroom and memory pressure signals p95 and p99 under concurrency Which serving features were active (paging, prefix cache hits, speculative verification) Then something important happens: what “doesn’t make sense” becomes a measurable incident class you can reproduce, explain, and control. And this connects directly to Runtime Observability: Inside the Serving Memory Stack. If you already track TTFT/TPOT, KV headroom, batch shape, and p95/p99, You already have the signals needed to explain and control this class of behavior. Tying memory to trust boundaries Yes, I know this is a rare part, but this is where most teams split into two camps. One camp optimizes performance and treats security as someone else’s job. The other camp locks everything down and wonders why cost explodes. In reality, memory reuse is both a performance strategy and a security decision. Most people treat performance and security as separate conversations. That is a mistake. Memory reuse, batching, prefix caching, and distributed KV transfer create shared surfaces. Shared surfaces create trust boundary demands. So the real engineering posture is: Performance asks you to reuse and share Security asks you to isolate and scope Production asks you to do both, with observability That is why I keep repeating the same line across different domains: Production ready AI is defined by survivability under uncertainty, and memory is where that uncertainty becomes measurable. Closing: What you should take away If you remember one thing, make it this: LLM inference can behave like a stateful memory system first, and a model endpoint second. The serving layer (KV cache growth, memory bandwidth during decode, allocator/paging behavior, and batching/scheduling) is what decides whether your system is stable under real traffic, or only impressive in demos. The hidden thing behind the rarest and most confusing production incidents is not “the model got smarter or dumber.” It’s when you think you’re calling a pure function, but you’re actually running a system that may not be strictly deterministic (GPU execution order, atomics, kernel selection) and/or a system that reuses/moves state (KV, prefix cache, paging, continuous batching). In those conditions, same prompt + same params is not always enough to guarantee bit-identical execution. This is why the references matter, they don’t claim magic. they give you mechanisms. PyTorch explicitly documents that some ops are nondeterministic unless you force deterministic algorithms (and may error if no deterministic implementation exists). CUDA thread scheduling/atomics can execute in different orders across runs, and modern serving stacks (e.g., PagedAttention) explicitly treat KV like virtual memory to deal with fragmentation and utilization limits under batching. What this means, depending on your role Senior Engineer Your win is to stop debugging by folklore. When behavior is “weird!” ask first: did the effective input change (grounding/tool traces), did the runtime state change (KV length/concurrency), or did the execution path change (batching/kernels)? Then prove it with telemetry. Principal Engineer Your job is to make it predictable. Design the serving invariants: cache scoping rules, allocator strategy (paging vs contiguous), admission control, and a determinism stance (what you guarantee, what you don’t, and how you detect drift). PyTorch literally gives you switches for deterministic enforcement, use them deliberately, knowing the tradeoffs. SRE Treat inference like an OS workload, queues, memory headroom, allocator efficiency, and p95/p99 under concurrency. If you can’t see TTFT/TPOT + KV headroom + batching behavior, you’re not observing the system you’re operating. CTO / Platform Owner The win isn’t buying bigger GPUs. It’s building control points: governance boundaries, isolation/scoping for shared state, determinism expectations, and operational discipline that makes rare failures survivable. My recommendation > Be explicit about what you optimize and what you guarantee. > If you need strict reproducibility, enforce deterministic modes where possible and accept performance tradeoffs. > If you need scale, treat KV as a first-class resource: paging/fragmentation and scheduling will bound throughput long before “model quality” does. > And for both: measure under concurrency, because that’s where systems stop sounding like opinions and start behaving like physics. Acknowledgments While this article dives into the hidden memory mechanics that shape LLM behavior under load, I’m grateful it was peer-reviewed and challenged before publishing. A special thank you to Hammad Atta for peer-reviewing this piece and challenging it from a security-and-systems angle. A special thank you to Luis Beltran for peer-reviewing this piece and challenging it from an AI engineering and deployment angle. A special thank you to André Melancia for peer-reviewing this piece and challenging it from an operational rigor angle. If this article resonated, it’s probably because I genuinely enjoy the hard parts, the layers most teams avoid because they’re messy, subtle, and unforgiving, If you’re dealing with real AI serving complexity in production, feel free to connect with me on LinkedIn. I’m always open to serious technical conversations and knowledge sharing with engineers building scalable production-grade systems. Thanks for reading, Hope this article helps you spot the hidden variables in serving and turn them into repeatable, testable controls. And I’d love to hear what you’re seeing in your own deployments. — Hazem Ali Microsoft AI MVP, Distinguished AI and ML Engineer / Architect1.5KViews0likes0CommentsWhat’s New in Microsoft EDU, Bett Edition January 2026
Welcome to our update for Microsoft Education and our special Bett 2026 edition! The Bett conference takes place in London during the week of January 21st - January 23rd, and Microsoft Education has 18 exciting updates to share! Check out the official Bett News blog here, and for our full Bett schedule and session times, be sure to check out our Microsoft EDU Bett 2026 guide. January 2026 topics: Microsoft 365 Updates for Educators Microsoft Learning Zone Microsoft 365 Updates for Students Teams EDU and OneNote EDU Updates Microsoft 365 LTI Updates Minecraft EDU 1. New Educator tools coming to the Teach Module in Microsoft 365 Unit Plans Soon educators will be able to create unit plans in Teach. Using a familiar interface, educators will be able to describe their unit, ground in existing content and educational standards, and attach any existing lesson plans. Unit plans will be created as Microsoft Word documents to facilitate easy edits and sharing. When: Preview in Spring 2026 Minecraft Lesson Plans Minecraft Education prepares students for the future workplace by helping build skills like collaboration, creative problem-solving, communication, and computational thinking. Coming soon, you will be able to create lesson plans in Teach that are fully teachable in Minecraft Education. And if you’re new to Minecraft Education, the lesson plan includes step-by-step instructions to get started. Just like the existing lesson plan tool in Teach, Minecraft Lessons can be grounded on your class details, existing content, and educational standards from 35+ countries. When: Preview in February 2026 Modify Content When: In Preview now Teach supports educators in modifying their existing teaching materials using AI-powered tools that save time and help meet the diverse needs of learners. With Modify existing content, educators can quickly adapt lessons they already use—without starting from scratch—by aligning materials to standards, differentiating instructions, adjusting reading levels, and enhancing text with supporting examples. Each modification tool accepts direct text input or file uploads from cloud storage, making it easy to transform current curriculum resources. These tools help educators maintain instructional intent while ensuring content is accessible, standards aligned, and effective for all learners. Align materials to standards Aligning instructional content to educational standards helps ensure lessons clearly support required learning goals and set the right expectations for learners. The Align to Standards tool rewrites existing lesson instructions so they reflect the intent of the selected standard—focusing on what learners should understand or be able to do—without copying the standard’s wording. Scenario: An educator has a lesson instruction for a reading activity on ecosystems. After selecting a state science standard, the educator uses Align to Standards to produce a revised instruction that emphasizes system interactions and evidence-based explanations while preserving the lesson’s original purpose. This allows the educator to strengthen alignment quickly without rewriting the lesson from scratch. Differentiate instructions Differentiation helps ensure every learner—regardless of readiness, background knowledge, or support needs—can access and engage with instructional tasks. The Differentiate Instructions tool adapts existing instructions based on specific supports an educator selects, such as adjusting reading level, including a single type of scaffold, or targeting a desired length. Because this tool is designed for single shot use, it produces a clear, accurate adaptation that adheres directly to the selected inputs. Scenario: A secondary biology educator has lab instructions written for general education learners but needs versions for learners requiring additional scaffolding. Using Differentiate Instructions, the educator quickly generates modified instructions that include step-by-step breakdowns, sentence starters, or graphic organizers—making the lab more accessible without changing the learning goal. Modify reading level Adjusting the reading level helps ensure instructional content remains accessible while preserving essential vocabulary and core concepts. The Modify reading level tool rewrites text to match a specified grade level, simplifying or increasing complexity as needed while maintaining meaning. Educators can also choose to generate a glossary with clear, age-appropriate definitions of key terms. Scenario: A social studies educator wants students to work with a primary source written at a university reading level. Using Modify reading level, the educator creates a version that maintains the document’s key ideas and important historical terms while simplifying sentence structure for lower secondary learners. By adding a glossary, students can access learner friendly definitions alongside the adapted text. Add supporting examples Concrete examples strengthen understanding by connecting abstract ideas to real world applications. The Add Supporting Examples tool enhances existing instructional content by appending relevant, accurate, and age-appropriate examples—without altering the original paragraph. Scenario: An educator teaching thermal energy transfer has a paragraph explaining that heat moves from warmer objects to cooler ones, but the concept feels abstract. Using Add Supporting Examples, the educator adds real world examples—such as a metal spoon warming in hot soup or an ice cube melting on a countertop—to help learners visualize how heat transfer works. These examples reinforce understanding and make the concept more accessible for secondary learners. Fill in the Blanks, Matching and Quizzing New Learning Activities are coming soon! We’re excited to introduce three new Learning Activities designed to make classroom experiences more dynamic and personalized: Fill in the Blanks, Matching, and Quizzes. Whether it’s completing paragraphs to strengthen comprehension, pairing terms with definitions in a timed matching game, or testing knowledge through quick self-assessments, these activities bring variety and fun to learning. Fill in the blanks creates paragraphs where learners can check their understanding by filling in missing terms. Matching is a game where learners can match terms and definitions while racing against the clock, aiming for fast completion and accuracy. And Quizzes allows students to quiz themselves and assess their comprehension. Learning Activities are available across our education products, in a standalone web app, in the Teach Module, in Teams for Education, in the Study and Learn agent and Study Guides. When: Spring 2026 Teach Module updates in Teams Classwork In Teams Classwork, you can already use Copilot to create Lesson Plans, Flashcards, and Fill in the Blank Activities. Coming this Spring, you will see the ability to create and modify more content, better matching the capabilities of Teach in the Microsoft 365 Copilot App. This includes modifying content with AI, Minecraft Lessons, and more! When: Coming soon Teach Module and Class Notebook integration We're bringing Copilot-powered creation tools directly into OneNote Class Notebook. Teachers will be able to generate Learning Activities and quizzes or modify existing content (like adjusting reading level or adding supporting examples) without leaving the page where they're already planning. When: Coming soon 2. Spark classroom engagement with Microsoft Learning Zone Educators worldwide are always looking for innovative ways to engage students, personalize learning, and support individual growth, yet limited time and resources often stand in their way. Microsoft Learning Zone, a new Windows app, empowers educators to transform any idea or resource into an interactive, personalized lesson using AI on Copilot+ PCs. The app also provides actionable insights to guide instruction and support every student’s progress. Learning Zone is now available to download from the Windows app store and included at no additional cost with all Microsoft Education licenses. Just in time for Bett 2026, Learning Zone has earned the prestigious ISTE Seal of Alignment - a recognized mark of quality, accessibility, and inclusive design. This recognition reflects our commitment to delivering meaningful, inclusive, and research-backed digital learning experiences for every learner. As noted by ISTE reviewers: "Microsoft Learning Zone saves educators valuable time while delivering personalized instruction that addresses individual learning needs." Getting started with Microsoft Learning Zone is simple. Educators begin by defining their lesson goals and preferences and can also choose to reference their teaching materials or trusted in-app resources by OpenStax. From there, AI does the heavy-lifting, generating a complete, interactive lesson with engaging content slides and a variety of practice activities. Educators can also quickly create Kahoot! quizzes using AI, bringing live classroom gamification into their lessons with just a few clicks. Learning Zone is more than content creation; it provides a full classroom-ready solution: from assignment to actionable insights. Once a lesson is created and reviewed, educators can assign it to students. Students complete lessons at their own pace, on any device, while the lesson flow adapts to their responses, helping reinforce understanding, revisit missed concepts, and build confidence over time. Educators, in turn, gain clear, actionable insights into student progress and mastery, enabling them to personalize instruction and better support every learner’s growth. Learning Zone is a classroom ready solution including management and actionable insights Learning Zone also includes an extensive library of ready-to-learn lessons developed in collaboration with leading global organizations, including the Nobel Peace Center, PBS NewsHour, the World Wildlife Fund (WWF), NASA, OpenStax, Figma, and Minecraft Education. Ready-to-learn lessons are available to educators and students on any Windows device and are a great way to inspire curiosity and bring meaningful learning of different subjects into the classroom. Ready-to-learn library in partnership with trusted global organizations Learning Zone is available today: Visit https://learningzone.microsoft.com to learn more and download the app. 3. New AI-powered tools for student learning in Microsoft 365 Study and Learn Agent Bring the interactive, conversational Study and Learn Agent in the Microsoft 365 Copilot App to your students. Available to all Microsoft EDU customers, the agent does not require an additional Copilot license. It is going into preview now, in January 2026. Join the Microsoft Education Insiders community at https://aka.ms/joinEIP and get information about getting access to the Preview. Study and Learn helps learners understand concepts, practice skills with activities like flashcards, and prepare for tests with study guides and quizzes. Additional activities including fill-in-the-blanks, matching, and others that will continue to be added. Purpose-built for learning in collaboration with learning science experts, Study and Learn aims to help foster reflective and critical thinking. Over time, it will provide a more personalized, adaptive, inclusive experience to make learning relevant and bolster motivation. When: January 2026 Preview Learning Activities app The Learning Activities Web App is now here! This web-based experience brings all your favorite activities together in one place, making it easier than ever to create, customize, and share engaging content. Whether you’re an educator designing lessons or a student building study sets, the web app offers a streamlined interface for finding or creating Flashcards and Fill in the Blanks with Matching, and Quizzes coming soon. You can easily access all your activities that you have created in other products from the web app, too. When: Available now! 4. Updates for your favorite teaching tools - Teams EDU and OneNote EDU Set AI Guidelines in Teams To help bring clarity to AI use in the classroom, AI Guidelines in Assignments allow educators to set clear expectations for when and how students can use AI—directly within the assignment experience. Educators start with a set of default, standardized AI use levels, and can apply them at the class or assignment level, with the ability to customize descriptions to reflect their school or district guidelines. These guidelines are clearly visible to students, reducing confusion and supporting responsible, transparent AI use, while also encouraging learners to use secure, education-ready Copilot. When: In Preview Q1 Add Learning Activities to Teams Assignments Learning Activities are coming to Teams Assignments and supported LMS platforms in preview, helping educators integrate interactive practice into the assignment workflows they already use. Educators can add activities such as Flashcards, Fill in the blanks, and Matching, and share resource documents that enable students to create their own learning activities within an assignment or the Classwork module. Students complete activities seamlessly within Assignments or their LMS, with progress captured as part of the assignment experience—supporting active, student driven learning while keeping setup, instruction and review in one familiar place. Students can create their own learning activities from educator-shared resources within an assignment or Classwork. When: In Preview Q1 New information literacy features in Search Progress in Teams Assignments Now students don't just gather sources—they investigate them. Four new research prompts (Source Reputation, Factual Importance, Cross-check, Source Purpose) make their thinking visible as they research. Read more about these new features in the preview blog here, and stay tuned for Microsoft Learn course updates to come. When: Available now Add Learning Zone lessons to Teams Assignments and LMS Learning Zone lessons are coming to Teams Assignments and Microsoft 365 LTI for LMS platforms in preview, allowing educators to bring interactive lessons directly into the assignments and grading workflows they already use. Educators can attach Learning Zone lessons during assignment creation, while students complete them fully embedded within Assignments or their LMS, with progress and scores automatically synchronized for review. This preview helps educators save time, reduce manual setup and grading steps, and confidently deliver interactive learning experiences—while keeping assignment creation, student work, and review all in one place. When: Preview in February Embed Learning Activities in OneNote You asked, we're building it. Soon, learners and educators alike will be able to copy a Learning Activity link, paste it into any OneNote classic page, and have it render inline – all to help folks engage without leaving the page. When: NH Spring 2026 5. Create with Copilot in your LMS In addition to supporting the new Learning Zone lessons in assignments, we are adding exciting new Create with Copilot options in Microsoft 365 LTI which bring the AI-powered capabilities of the Teach Module directly into LMS content creation workflows. From within their course, educators can use Copilot to draft lesson materials and other instructional content which is seamlessly published to the course using familiar Microsoft 365 tools. Create with Copilot is also available in LMS content editors to help educators compose content, discussion posts, and more. This includes the ability to modify existing content, if supported by the LMS platform. By embedding the creation experience where courses are designed and managed, Microsoft 365 LTI helps educators preserve instructional intent, reduce context switching, and move more quickly from planning to teaching. Microsoft 365 LTI is available to any Microsoft Education customer without additional licensing. LMS administrators can deploy the integration to an LTI 1.3 compatible LMS like Canvas, Blackboard, PowerSchool Schoology Learning, D2L/Brightspace and Moodle to get started! When: Preview in February 6. Dedicated servers coming to Minecraft Education Minecraft Education is launching a new feature that enables IT administrators and educators to run dedicated servers to host persistent worlds for use in classrooms and after-school programs, similar to Minecraft Bedrock’s dedicated servers (for the consumer version of the game). Dedicated servers enable cross-tenant gameplay, which is a gamechanger for expanding multiplayer experiences in the classroom or running Minecraft esports programs with other schools. This feature is currently in Beta to release in February for general availability for all Minecraft Education users. (Minecraft Education is available in Microsoft A3 and A5 software subscriptions for schools.) ___________________________________________________________________________________ And finally, just to recap all the news we have for you this month, here’s a quick review of all the features that are generally available or are rolling out soon: Teach Module Microsoft 365 Updates for Educators • Unit Plans – available in spring • Minecraft Lesson plans – preview in February • Modify content – align to standards. Private preview now • Modify content – modify reading level. Private preview now • Modify content – add supporting examples Private preview now • Modify content – differentiate instructions. Private preview now • Teach Module integration into OneNote Class Notebooks – preview in spring Microsoft Learning Zone • Available to download from the Windows store, at no additional cost • Provide full classroom ready solution including lesson management and insights • Teach Module, Teams Assignments and LMS integration in March Microsoft 365 Updates for Students • Study and Learn Agent – preview in late January • Learning Activities – Fill in the Blanks generally available • Learning Activities – Matching Activities in private preview now • Learning Activities – Self-quizzing available in private preview in February Teams and OneNote EDU Updates • Set expected AI use in Assignments – private preview end of January • Add Flashcards to Assignments – private preview in February • New information literacy features in Search Progress • Embed Learning Activities in OneNote – private preview in spring Copilot in your Learning Management System Dedicated Minecraft EDU servers Have any feedback to share with us? As always, we'd love to hear it! Mike Tholfsen Group Product Manager Microsoft Education3.7KViews4likes4CommentsAdvanced Function Calling and Multi-Agent Systems with Small Language Models in Foundry Local
Advanced Function Calling and Multi-Agent Systems with Small Language Models in Foundry Local In our previous exploration of function calling with Small Language Models, we demonstrated how to enable local SLMs to interact with external tools using a text-parsing approach with regex patterns. While that method worked, it required manual extraction of function calls from the model's output; functional but fragile. Today, I'm excited to show you something far more powerful: Foundry Local now supports native OpenAI-compatible function calling with select models. This update transforms how we build agentic AI systems locally, making it remarkably straightforward to create sophisticated multi-agent architectures that rival cloud-based solutions. What once required careful prompt engineering and brittle parsing now works seamlessly through standardized API calls. We'll build a complete multi-agent quiz application that demonstrates both the elegance of modern function calling and the power of coordinated agent systems. The full source code is available in this GitHub repository, but rather than walking through every line of code, we'll focus on how the pieces work together and what you'll see when you run it. What's New: Native Function Calling in Foundry Local As we explored in our guide to running Phi-4 locally with Foundry Local, we ran powerful language models on our local machine. The latest version now support native function calling for models specifically trained with this capability. The key difference is architectural. In our weather assistant example, we manually parsed JSON strings from the model's text output using regex patterns and frankly speaking, meticulously testing and tweaking the system prompt for the umpteenth time 🙄. Now, when you provide tool definitions to supported models, they return structured tool_calls objects that you can directly execute. Currently, this native function calling capability is available for the Qwen 2.5 family of models in Foundry Local. For this tutorial, we're using the 7B variant, which strikes a great balance between capability and resource requirements. Quick Setup Getting started requires just a few steps. First, ensure you have Foundry Local installed. On Windows, use winget install Microsoft.FoundryLocal , and on macOS, use bash brew install microsoft/foundrylocal/foundrylocal You'll need version 0.8.117 or later. Install the Python dependencies in the requirements file, then start your model. The first run will download approximately 4GB: foundry model run qwen2.5-7b-instruct-cuda-gpu If you don't have a compatible GPU, use the CPU version instead, or you can specify any other Qwen 2.5 variant that suits your hardware. I have set a DEFAULT_MODEL_ALIAS variable you can modify to use different models in utils/foundry_client.py file. Keep this terminal window open. The model needs to stay running while you develop and test your application. Understanding the Architecture Before we dive into running the application, let's understand what we're building. Our quiz system follows a multi-agent architecture where specialized agents handle distinct responsibilities, coordinated by a central orchestrator. The flow works like this: when you ask the system to generate a quiz about photosynthesis, the orchestrator agent receives your message, understands your intent, and decides which tool to invoke. It doesn't try to generate the quiz itself, instead, it calls a tool that creates a specialist QuizGeneratorAgent focused solely on producing well-structured quiz questions. Then there's another agent, reviewAgent, that reviews the quiz with you. The project structure reflects this architecture: quiz_app/ ├── agents/ # Base agent + specialist agents ├── tools/ # Tool functions the orchestrator can call ├── utils/ # Foundry client connection ├── data/ ├── quizzes/ # Generated quiz JSON files │── responses/ # User response JSON files └── main.py # Application entry point The orchestrator coordinates three main tools: generate_new_quiz, launch_quiz_interface, and review_quiz_interface. Each tool either creates a specialist agent or launches an interactive interface (Gradio), handling the complexity so the orchestrator can focus on routing and coordination. How Native Function Calling Works When you initialize the orchestrator agent in main.py, you provide two things: tool schemas that describe your functions to the model, and a mapping of function names to actual Python functions. The schemas follow the OpenAI function calling specification, describing each tool's purpose, parameters, and when it should be used. Here's what happens when you send a message to the orchestrator: The agent calls the model with your message and the tool schemas. If the model determines a tool is needed, it returns a structured tool_calls attribute containing the function name and arguments as a proper object—not as text to be parsed. Your code executes the tool, creates a message with "role": "tool" containing the result, and sends everything back to the model. The model can then either call another tool or provide its final response. The critical insight is that the model itself controls this flow through a while loop in the base agent. Each iteration represents the model examining the current state, deciding whether it needs more information, and either proceeding with another tool call or providing its final answer. You're not manually orchestrating when tools get called; the model makes those decisions based on the conversation context. Seeing It In Action Let's walk through a complete session to see how these pieces work together. When you run python main.py, you'll see the application connect to Foundry Local and display a welcome banner: Now type a request like "Generate a 5 question quiz about photosynthesis." Watch what happens in your console: The orchestrator recognized your intent, selected the generate_new_quiz tool, and extracted the topic and number of questions from your natural language request. Behind the scenes, this tool instantiated a QuizGeneratorAgent with a focused system prompt designed specifically for creating quiz JSON. The agent used a low temperature setting to ensure consistent formatting and generated questions that were saved to the data/quizzes folder. This demonstrates the first layer of the multi-agent architecture: the orchestrator doesn't generate quizzes itself. It recognizes that this task requires specialized knowledge about quiz structure and delegates to an agent built specifically for that purpose. Now request to take the quiz by typing "Take the quiz." The orchestrator calls a different tool and Gradio server is launched. Click the link to open in a browser window displaying your quiz questions. This tool demonstrates how function calling can trigger complex interactions—it reads the quiz JSON, dynamically builds a user interface with radio buttons for each question, and handles the submission flow. After you answer the questions and click submit, the interface saves your responses to the data/responses folder and closes the Gradio server. The orchestrator reports completion: The system now has two JSON files: one containing the quiz questions with correct answers, and another containing your responses. This separation of concerns is important—the quiz generation phase doesn't need to know about response collection, and the response collection doesn't need to know how quizzes are created. Each component has a single, well-defined responsibility. Now request a review. The orchestrator calls the third tool: A new chat interface opens, and here's where the multi-agent architecture really shines. The ReviewAgent is instantiated with full context about both the quiz questions and your answers. Its system prompt includes a formatted view of each question, the correct answer, your answer, and whether you got it right. This means when the interface opens, you immediately see personalized feedback: The Multi-Agent Pattern Multi-agent architectures solve complex problems by coordinating specialized agents rather than building monolithic systems. This pattern is particularly powerful for local SLMs. A coordinator agent routes tasks to specialists, each optimized for narrow domains with focused system prompts and specific temperature settings. You can use a 1.7B model for structured data generation, a 7B model for conversations, and a 4B model for reasoning, all orchestrated by a lightweight coordinator. This is more efficient than requiring one massive model for everything. Foundry Local's native function calling makes this straightforward. The coordinator reliably invokes tools that instantiate specialists, with structured responses flowing back through proper tool messages. The model manages the coordination loop—deciding when it needs another specialist, when it has enough information, and when to provide a final answer. In our quiz application, the orchestrator routes user requests but never tries to be an expert in quiz generation, interface design, or tutoring. The QuizGeneratorAgent focuses solely on creating well-structured quiz JSON using constrained prompts and low temperature. The ReviewAgent handles open-ended educational dialogue with embedded quiz context and higher temperature for natural conversation. The tools abstract away file management, interface launching, and agent instantiation, the orchestrator just knows "this tool launches quizzes" without needing implementation details. This pattern scales effortlessly. If you wanted to add a new capability like study guides or flashcards, you could just easily create a new tool or specialists. The orchestrator gains these capabilities automatically by having the tool schemas you have defined without modifying core logic. This same pattern powers production systems with dozens of specialists handling retrieval, reasoning, execution, and monitoring, each excelling in its domain while the coordinator ensures seamless collaboration. Why This Matters The transition from text-parsing to native function calling enables a fundamentally different approach to building AI applications. With text parsing, you're constantly fighting against the unpredictability of natural language output. A model might decide to explain why it's calling a function before outputting the JSON, or it might format the JSON slightly differently than your regex expects, or it might wrap it in markdown code fences. Native function calling eliminates this entire class of problems. The model is trained to output tool calls as structured data, separate from its conversational responses. The multi-agent aspect builds on this foundation. Because function calling is reliable, you can confidently delegate to specialist agents knowing they'll integrate smoothly with the orchestrator. You can chain tool calls—the orchestrator might generate a quiz, then immediately launch the interface to take it, based on a single user request like "Create and give me a quiz about machine learning." The model handles this orchestration intelligently because the tool results flow back as structured data it can reason about. Running everything locally through Foundry Local adds another dimension of value and I am genuinely excited about this (hopefully, the phi models get this functionality soon). You can experiment freely, iterate quickly, and deploy solutions that run entirely on your infrastructure. For educational applications like our quiz system, this means students can interact with the AI tutor as much as they need without cost concerns. Getting Started With Your Own Multi-Agent System The complete code for this quiz application is available in the GitHub repository, and I encourage you to clone it and experiment. Try modifying the tool schemas to see how the orchestrator's behavior changes. Add a new specialist agent for a different task. Adjust the system prompts to see how agent personalities and capabilities shift. Think about the problems you're trying to solve. Could they benefit from having different specialists handling different aspects? A customer service system might have agents for order lookup, refund processing, and product recommendations. A research assistant might have agents for web search, document summarization, and citation formatting. A coding assistant might have agents for code generation, testing, and documentation. Start small, perhaps with two or three specialist agents for a specific domain. Watch how the orchestrator learns to route between them based on the tool descriptions you provide. You'll quickly see opportunities to add more specialists, refine the existing ones, and build increasingly sophisticated systems that leverage the unique strengths of each agent while presenting a unified, intelligent interface to your users. In the next entry, we will be deploying our quizz app which will mark the end of our journey in Foundry and SLMs these past few weeks. I hope you are as excited as I am! Thanks for reading.349Views0likes0CommentsAI Didn’t Break Your Production — Your Architecture Did
Most AI systems don’t fail in the lab. They fail the moment production touches them. I’m Hazem Ali — Microsoft AI MVP, Principal AI & ML Engineer / Architect, and Founder & CEO of Skytells. With a strong foundation in AI and deep learning from low-level fundamentals to production-scale, backed by rigorous cybersecurity and software engineering expertise, I design and deliver enterprise AI systems end-to-end. I often speak about what happens after the pilot goes live: real users arrive, data drifts, security constraints tighten, and incidents force your architecture to prove it can survive. My focus is building production AI with a security-first mindset: identity boundaries, enforceable governance, incident-ready operations, and reliability at scale. My mission is simple: Architect and engineer secure AI systems that operate safely, predictably, and at scale in production. And here’s the hard truth: AI initiatives rarely fail because the model is weak. They fail because the surrounding architecture was never engineered for production reality. - Hazem Ali You see this clearly when teams bolt AI onto an existing platform. In Azure-based environments, the foundation can be solid—identity, networking, governance, logging, policy enforcement, and scale primitives. But that doesn’t make the AI layer production-grade by default. It becomes production-grade only when the AI runtime is engineered like a first-class subsystem with explicit boundaries, control points, and designed failure behavior. A quick moment from the field I still remember one rollout that looked perfect on paper. Latency was fine. Error rate was low. Dashboards were green. Everyone was relaxed. Then a single workflow started creating the wrong tickets, not failing or crashing. It was confidently doing the wrong thing at scale. It took hours before anyone noticed, because nothing was broken in the traditional sense. When we finally traced it, the model was not the root cause. The system had no real gates, no replayable trail, and tool execution was too permissive. The architecture made it easy for a small mistake to become a widespread mess. That is the gap I’m talking about in this article. Production Failure Taxonomy This is the part most teams skip because it is not exciting, and it is not easy to measure in a demo. When AI fails in production, the postmortem rarely says the model was bad. It almost always points to missing boundaries, over-privileged execution, or decisions nobody can trace. So if your AI can take actions, you are no longer shipping a chat feature. You are operating a runtime that can change state across real systems, that means reliability is not just uptime. It is the ability to limit blast radius, reproduce decisions, and stop or degrade safely when uncertainty or risk spikes. You can usually tell early whether an AI initiative will survive production. Not because the model is weak, but because the failure mode is already baked into the architecture. Here are the ones I see most often. 1. Healthy systems that are confidently wrong Uptime looks perfect. Latency is fine. And the output is wrong. This is dangerous because nothing alerts until real damage shows up. 2. The agent ends up with more authority than the user The user asks a question. The agent has tools and credentials. Now it can do things the user never should have been able to do in that moment. 3. Each action is allowed, but the chain is not Read data, create ticket, send message. All approved individually. Put together, it becomes a capability nobody reviewed. 4. Retrieval becomes the attack path Most teams worry about prompt injection. Fair. But a poisoned or stale retrieval layer can be worse, because it feeds the model the wrong truth. 5. Tool calls turn mistakes into incidents The moment AI can change state—config, permissions, emails, payments, or data—a mistake is no longer a bad answer. It is an incident. 6. Retries duplicate side effects Timeouts happen. Retries happen. If your tool calls are not safe to repeat, you will create duplicate tickets, refunds, emails, or deletes. Next, let’s talk about what changes when you inject probabilistic behavior into a deterministic platform. In the Field: Building and Sharing Real-World AI In December 2025, I had the chance to speak and engage with builders across multiple AI and technology events, sharing what I consider the most valuable part of the journey: the engineering details that show up when AI meets production reality. This photo captures one of those moments: real conversations with engineers, architects, and decision-makers about what it truly takes to ship production-grade AI. During my session, Designing Scalable and Secure Architecture at the Enterprise Scale I walked through the ideas in this article live on stage then went deeper into the engineering reality behind them: from zero-trust boundaries and runtime policy enforcement to observability, traceability, and safe failure design, The goal wasn’t to talk about “AI capability,” but to show how to build AI systems that operate safely and predictably at scale in production. Deterministic platforms, probabilistic behavior Most production platforms are built for deterministic behavior: defined contracts, predictable services, stable outputs. AI changes the physics. You introduce probabilistic behavior into deterministic pipelines and your failure modes multiply. An AI system can be confidently wrong while still looking “healthy” through basic uptime dashboards. That’s why reliability in production AI is rarely about “better prompts” or “higher model accuracy.” It’s about engineering the right control points: identity boundaries, governance enforcement, behavioral observability, and safe degradation. In other words: the model is only one component. The system is the product. Production AI Control Plane Here’s the thing. Once you inject probabilistic behavior into a deterministic platform, you need more than prompts and endpoints. You need a control plane. Not a fancy framework. Just a clear place in the runtime where decisions get bounded, actions get authorized, and behavior becomes explainable when something goes wrong. This is the simplest shape I have seen work in real enterprise systems. The control plane components Orchestrator Owns the workflow. Decides what happens next, and when the system should stop. Retrieval Brings in context, but only from sources you trust and can explain later. Prompt assembly Builds the final input to the model, including constraints, policy signals, and tool schemas. Model call Generates the plan or the response. It should never be trusted to execute directly. Policy Enforcement Point The gate before any high impact step. It answers: is this allowed, under these conditions, with these constraints. Tool Gateway The firewall for actions. Scopes every operation, validates inputs, rate-limits, and blocks unsafe calls. Audit log and trace store A replayable chain for every request. If you cannot replay it, you cannot debug it. Risk engine Detects prompt injection signals, anomalous sessions, uncertainty spikes, and switches the runtime into safer modes. Approval flow For the few actions that should never be automatic. It is the line between assistance and damage. If you take one idea from this section, let it be this. The model is not where you enforce safety. Safety lives in the control plane. Next, let’s talk about the most common mistake teams make right after they build the happy-path pipeline. Treating AI like a feature. The common architectural trap: treating AI like a feature Many teams ship AI like a feature: prompt → model → response. That structure demos well. In production, it collapses the moment AI output influences anything stateful tickets, approvals, customer messaging, remediation actions, or security decisions. At that point, you’re not “adding AI.” You’re operating a semi-autonomous runtime. The engineering questions become non-negotiable: Can we explain why the system responded this way? Can we bound what it’s allowed to do? Can we contain impact when it’s wrong? Can we recover without human panic? If those answers aren’t designed into the architecture, production becomes a roulette wheel. Governance is not a document It’s a runtime enforcement capability Most governance programs fail because they’re implemented as late-stage checklists. In production, governance must live inside the execution path as an enforceable mechanism, A Policy Enforcement Point (PEP) that evaluates every high-impact step before it happens. At the moment of execution, your runtime must answer a strict chain of authorization questions: 1. What tools is this agent attempting to call? Every tool invocation is a privilege boundary. Your runtime must identify the tool, the operation, and the intended side effect (read vs write, safe vs state-changing). 2. Does the tool have the right permissions to run for this agent? Even before user context, the tool itself must be runnable by the agent’s workload identity (service principal / managed identity / workload credentials). If the agent identity can’t execute the tool, the call is denied period. 3. If the tool can run, is the agent permitted to use it for this user? This is the missing piece in most systems: delegation. The agent might be able to run the tool in general, but not on behalf of this user, in this tenant, in this environment, for this task category. This is where you enforce: user role / entitlement tenant boundaries environment (prod vs staging) session risk level (normal vs suspicious) 4. If yes, which tasks/operations are permitted? Tools are too broad. Permissions must be operation-scoped. Not “Jira tool allowed.” But “Jira: create ticket only, no delete, no project-admin actions.” Not “Database tool allowed.” But “DB: read-only, specific schema, specific columns, row-level filters.” This is ABAC/RBAC + capability-based execution. 5. What data scope is allowed? Even a permitted tool operation must be constrained by data classification and scope: public vs internal vs confidential vs PII row/column filters time-bounded access purpose limitation (“only for incident triage”) If the system can’t express data scope at runtime, it can’t claim governance. 6. What operations require human approval? Some actions are inherently high risk: payments/refunds changing production configs emailing customers deleting data executing scripts The policy should return “REQUIRE_APPROVAL” with clear obligations (what must be reviewed, what evidence is required, who can approve). 7. What actions are forbidden under certain risk conditions? Risk-aware policy is the difference between governance and theater. Examples: If prompt injection signals are high → disable tool execution If session is anomalous → downgrade to read-only mode If data is PII + user not entitled → deny and redact If environment is prod + request is destructive → block regardless of model confidence The key engineering takeaway Governance works only when it’s enforceable, runtime-evaluated, and capability-scoped: Agent identity answers: “Can it run at all?” Delegation answers: “Can it run for this user?” Capabilities answer: “Which operations exactly?” Data scope answers: “How much and what kind of data?” Risk gates + approvals answer: “When must it stop or escalate?” If policy can’t be enforced at runtime, it isn’t governance. It’s optimism. Safe Execution Patterns Policy answers whether something is allowed. Safe execution answers what happens when things get messy. Because they will, Models time out, Retries happen, Inputs are adversarial. People ask for the wrong thing. Agents misunderstand. And when tools can change state, small mistakes turn into real incidents. These patterns are what keep the system stable when the world is not. 👈 Two-phase execution Do not execute directly from a model output. First phase: propose a plan and a dry-run summary of what will change. Second phase: execute only after policy gates pass, and approval is collected if required. Idempotency for every write If a tool call can create, refund, email, delete, or deploy, it must be safe to retry. Every write gets an idempotency key, and the gateway rejects duplicates. This one change prevents a huge class of production pain. Default to read-only when risk rises When injection signals spike, when the session looks anomalous, when retrieval looks suspicious, the system should not keep acting. It should downgrade. Retrieve, explain, and ask. No tool execution. Scope permissions to operations, not tools Tools are too broad. Do not allow Jira. Allow create ticket in these projects, with these fields. Do not allow database access. Allow read-only on this schema, with row and column filters. Rate limits and blast radius caps Agents should have a hard ceiling. Max tool calls per request. Max writes per session. Max affected entities. If the cap is hit, stop and escalate. A kill switch that actually works You need a way to disable tool execution across the fleet in one move. When an incident happens, you do not want to redeploy code. You want to stop the bleeding. If you build these in early, you stop relying on luck. You make failure boring, contained, and recoverable. Think for scale, in the Era of AI for AI I want to zoom out for a second, because this is the shift most teams still design around. We are not just adding AI to a product. We are entering a phase where parts of the system can maintain and improve themselves. Not in a magical way. In a practical, engineering way. A self-improving system is one that can watch what is happening in production, spot a class of problems, propose changes, test them, and ship them safely, while leaving a clear trail behind it. It can improve code paths, adjust prompts, refine retrieval rules, update tests, and tighten policies. Over time, the system becomes less dependent on hero debugging at 2 a.m. What makes this real is the loop, not the model. Signals come in from logs, traces, incidents, drift metrics, and quality checks. The system turns those signals into a scoped plan. Then it passes through gates: policy and permissions, safe scope, testing, and controlled rollout. If something looks wrong, it stops, downgrades to read-only, or asks for approval. This is why scale changes. In the old world, scale meant more users and more traffic. In the AI for AI world, scale also means more autonomy. One request can trigger many tool calls. One workflow can spawn sub-agents. One bad signal can cause retries and cascades. So the question is not only can your system handle load. The question is can your system handle multiplication without losing control. If you want self-improving behavior, you need three things to be true: The system is allowed to change only what it can prove is safe to change. Every change is testable and reversible. Every action is traceable, so you can replay why it happened. When those conditions exist, self-improvement becomes an advantage. When they do not, self-improvement becomes automated risk. And this leads straight into governance, because in this era governance is not a document. It is the gate that decides what the system is allowed to improve, and under which conditions. Observability: uptime isn’t enough — you need traceability and causality Traditional observability answers: Is the service up. Is it fast. Is it erroring. That is table stakes. Production AI needs a deeper truth: why did it do that. Because the system can look perfectly healthy while still making the wrong decision. Latency is fine. Error rate is fine. Dashboards are green. And the output is still harmful. To debug that kind of failure, you need causality you can replay and audit: Input → context retrieval → prompt assembly → model response → tool invocation → final outcome Without this chain, incident response becomes guesswork. People argue about prompts, blame the model, and ship small patches that do not address the real cause. Then the same issue comes back under a different prompt, a different document, or a slightly different user context. The practical goal is simple. Every high-impact action should have a story you can reconstruct later. What did the system see. What did it pull. What did it decide. What did it touch. And which policy allowed it. When you have that, you stop chasing symptoms. You can fix the actual failure point, and you can detect drift before users do. RAG Governance and Data Provenance Most teams treat retrieval as a quality feature. In production, retrieval is a security boundary. Because the moment a document enters the context window, it becomes part of the system’s brain for that request. If retrieval pulls the wrong thing, the model can behave perfectly and still lead you to a bad outcome. I learned this the hard way, I have seen systems where the model was not the problem at all. The problem was a single stale runbook that looked official, ranked high, and quietly took over the decision. Everything downstream was clean. The agent followed instructions, called the right tools, and still caused damage because the truth it was given was wrong. I keep repeating one line in reviews, and I mean it every time: Retrieval is where truth enters the system. If you do not control that, you are not governing anything. - Hazem Ali So what makes retrieval safe enough for enterprise use? Provenance on every chunk Every retrieved snippet needs a label you can defend later: source, owner, timestamp, and classification. If you cannot answer where it came from, you cannot trust it for actions. Staleness budgets Old truth is a real risk. A runbook from last quarter can be more dangerous than no runbook at all. If content is older than a threshold, the system should say it is old, and either confirm or downgrade to read-only. No silent reliance. Allowlisted sources per task Not all sources are valid for all jobs. Incident response might allow internal runbooks. Customer messaging might require approved templates only. Make this explicit. Retrieval should not behave like a free-for-all search engine. Scope and redaction before the model sees it Row and column limits, PII filtering, secret stripping, tenant boundaries. Do it before prompt assembly, not after the model has already seen the data. Citation requirement for high-impact steps If the system is about to take a high-impact action, it should be able to point to the sources that justified it. If it cannot, it should stop and ask. That one rule prevents a lot of confident nonsense. Monitor retrieval like a production dependency Track which sources are being used, which ones cause incidents, and where drift is coming from. Retrieval quality is not static. Content changes. Permissions change. Rankings shift. Behavior follows. When you treat retrieval as governance, the system stops absorbing random truth. It consumes controlled truth, with ownership, freshness, and scope. That is what production needs. Security: API keys aren’t a strategy when agents can act The highest-impact AI incidents are usually not model hacks. They are architectural failures: over-privileged identities, blurred trust boundaries, unbounded tool access, and unsafe retrieval paths. Once an agent can call tools that mutate state, treat it like a privileged service, not a chatbot. Least privilege by default Explicit authorization boundaries Auditable actions Containment-first design Clear separation between user intent and system authority This is how you prevent a prompt injection from turning into a system-level breach. If you want the deeper blueprint and the concrete patterns for securing agents in practice, I wrote a full breakdown here: Zero-Trust Agent Architecture: How to Actually Secure Your Agents What “production-ready AI” actually means Production-ready AI is not defined by a benchmark score. It’s defined by survivability under uncertainty. A production-grade AI system can: Explain itself with traceability. Enforce policy at runtime. Contain blast radius when wrong. Degrade safely under uncertainty. Recover with clear operational playbooks. If your system can’t answer “how does it fail?” you don’t have production AI yet.. You have a prototype with unmanaged risk. How Azure helps you engineer production-grade AI Azure doesn’t “solve” production-ready AI by itself, it gives you the primitives to engineer it correctly. The difference between a prototype and a survivable system is whether you translate those primitives into runtime control points: identity, policy enforcement, telemetry, and containment. 1. Identity-first execution (kill credential sprawl, shrink blast radius) A production AI runtime should not run on shared API keys or long-lived secrets. In Azure environments, the most important mindset shift is: every agent/workflow must have an identity and that identity must be scoped. Guidance Give each agent/orchestrator a dedicated identity (least privilege by default). Separate identities by environment (prod vs staging) and by capability (read vs write). Treat tool invocation as a privileged service call, never “just a function.” Why this matters If an agent is compromised (or tricked via prompt injection), identity boundaries decide whether it can read one table or take down a whole environment. 2. Policy as enforcement (move governance into the execution path) Your article’s core idea governance is runtime enforcement maps perfectly to Azure’s broader governance philosophy: policies must be enforceable, not advisory. Guidance Create an explicit Policy Enforcement Point (PEP) in your agent runtime. Make the PEP decision mandatory before executing any tool call or data access. Use “allow + obligations” patterns: allow only with constraints (redaction, read-only mode, rate limits, approval gates, extra logging). Why this matters Governance fails when it’s a document. It works when it’s compiled into runtime decisions. 3. Observability that explains behavior Azure’s telemetry stack is valuable because it’s designed for distributed systems: correlation, tracing, and unified logs. Production AI needs the same plus decision traceability. Guidance Emit a trace for every request across: retrieval → prompt assembly → model call → tool calls → outcome. Log policy decisions (allow/deny/require approval) with policy version + obligations applied. Capture “why” signals: risk score, classifier outputs, injection signals, uncertainty indicators. Why this matters When incidents happen, you don’t just debug latency — you debug behavior. Without causality, you can’t root-cause drift or containment failures. 4. Zero-trust boundaries for tools and data Azure environments tend to be strong at network segmentation and access control. That foundation is exactly what AI systems need because AI introduces adversarial inputs by default. Guidance Put a Tool Gateway in front of tools (Jira, email, payments, infra) and enforce scopes there. Restrict data access by classification (PII/secret zones) and enforce row/column constraints. Degrade safely: if risk is high, drop to read-only, disable tools, or require approval. Why this matters Prompt injection doesn’t become catastrophic when your system has hard boundaries and graceful failure modes. 5. Practical “production-ready” checklist (Azure-aligned, engineering-first) If you want a concrete way to apply this: Identity: every runtime has a scoped identity; no shared secrets PEP: every tool/data action is gated by policy, with obligations Traceability: full chain captured and correlated end-to-end Containment: safe degradation + approval gates for high-risk actions Auditability: policy versions and decision logs are immutable and replayable Environment separation: prod ≠ staging identities, tools, and permissions Outcome This is how you turn “we integrated AI” into “we operate AI safely at scale.” Operating Production AI A lot of teams build the architecture and still struggle, because production is not a diagram. It is a living system. So here is the operating model I look for when I want to trust an AI runtime in production. The few SLOs that actually matter Trace completeness For high-impact requests, can we reconstruct the full chain every time, without missing steps. Policy coverage What percentage of tool calls and sensitive reads pass through the policy gate, with a recorded decision. Action correctness Not model accuracy. Real-world correctness. Did the system take the right action, on the right target, with the right scope. Time to contain When something goes wrong, how fast can we stop tool execution, downgrade to read-only, or isolate a capability. Drift detection time How quickly do we notice behavioral drift before users do. The runbooks you must have If you operate agents, you need simple playbooks for predictable bad days: Injection spike → safe mode, block tool execution, force approvals Retrieval poisoning suspicion → restrict sources, raise freshness requirements, require citations Retry storm → enforce idempotency, rate limits, and circuit breakers Tool gateway instability → fail closed for writes, degrade safely for reads Model outage → fall back to deterministic paths, templates, or human escalation Clear ownership Someone has to own the runtime, not just the prompts. Platform owns the gates, tool gateway, audit, and tracing Product owns workflows and user-facing behavior Security owns policy rules, high-risk approvals, and incident procedures When these pieces are real, production becomes manageable. When they are not, you rely on luck and hero debugging. The 60-second production readiness checklist If you want a fast sanity check, here it is. Every agent has an identity, scoped per environment No shared API keys for privileged actions Every tool call goes through a policy gate with a logged decision Permissions are scoped to operations, not whole tools Writes are idempotent, retries cannot duplicate side effects Tool gateway validates inputs, scopes data, and rate-limits actions There is a safe mode that disables tools under risk There is a kill switch that stops tool execution across the fleet Retrieval is allowlisted, provenance-tagged, and freshness-aware High-impact actions require citations or they stop and ask Audit logs are immutable enough to trust later Traces are replayable end-to-end for any incident If most of these are missing, you do not have production AI yet. You have a prototype with unmanaged risk. A quick note In Azure-based enterprises, you already have strong primitives that mirror the mindset production AI requires: identity-first access control (Microsoft Entra ID), secure workload authentication patterns (managed identities), and deep telemetry foundations (Azure Monitor / Application Insights). The key is translating that discipline into the AI runtime so governance, identity, and observability aren’t external add-ons, but part of how AI executes and acts. Closing Models will keep evolving. Tooling will keep improving. But enterprise AI success still comes down to systems engineering. If you’re building production AI today, what has been the hardest part in your environment: governance, observability, security boundaries, or operational reliability? If you’re dealing with deep technical challenges around production AI, agent security, RAG governance, or operational reliability, feel free to connect with me on LinkedIn. I’m open to technical discussions and architecture reviews. Thanks for reading. — Hazem Ali962Views0likes0Comments