llm
7 TopicsThe Hidden Architecture of Nano Architectures
Why does the same prompt, on the same checkpoint, with temperature set to zero, sometimes produce a different answer only when the system is under real load? If you have ever watched token three flip and then watched the whole completion diverge, you already know this is not a product bug. It is a systems fact. Here is the thing. In production, you did not deploy a model. You deployed a runtime that selects an execution plan under constraints. The weights are inside that plan. The behavior is the plan. I’m Hazem Ali — Microsoft AI MVP, Distinguished AI and ML Engineer and Architect, and Founder and CEO of Skytells. I’ve built and led engineering work that turns deep learning research into production systems that survive real-world constraints. I speak at major conferences and technical communities, and I regularly deliver deep technical sessions on enterprise AI and agent architectures. If there’s one thing you’ll notice about me, it’s that I’m drawn to the deepest layers of engineering, the parts most teams only discover when systems are under real pressure. My specialization spans the full AI stack, from deep learning and system design to enterprise architecture and security. A rule I repeat in every serious review is simple. If you cannot explain the runtime, you do not understand the model you deployed. — Hazem Ali This is the next layer after my earlier deep dive on memory, KV cache, paging, and trust boundaries in The Hidden Memory Architecture of LLMs I also break down the memory-and-paging failure modes in When Your LLM Trips the MMU This one goes lower, into the execution that decides which math actually runs. When I Had to Prove It Live I still remember the first time I had to make this concrete in front of a room full of engineers. It was during a technical session I gave, and the question came up in the exact form you’ve probably heard before: Why does the same prompt on the same checkpoint, with temperature set to zero, sometimes produce a different answer only under real load? So I answered it the only way that holds up in a serious engineering room. I didn’t frame it as randomness. I framed it as execution. Not because it sounds cleaner, but because it is the only framing that survives scrutiny: under load, the system is not evaluating the same computation. In production, you don’t deploy weights in isolation. You deploy a runtime that selects an execution plan under constraints. Under load, the constraints change at token cadence: microbatch membership shifts, shapes shift, workspace feasibility tightens, and kernels or algorithms that were legal in the calm regime can become infeasible in the pressured regime. The runtime stays correct by contract, but it executes a different plan. And once the executed plan changes, reduction staging can change. When reduction staging changes, rounding happens at different points. That can move last bits. In decoding, last bits can become different tokens when early logit margins are thin. After the first token flips, divergence is expected because the context is different. That’s what I mean throughout this article when I say: The weights are inside the plan, but the behavior is the plan. What is Happening in Runtime Let’s start with the part most teams skip: the runtime pipeline from admission to a token. A production LLM server is not a function call. It is a control plane. And under real load, it behaves like one. It is not asking “what does the model say.” It is asking “what can I execute right now without breaking my guarantees.” Right now matters. Not in theory, in milliseconds. Because every decode step is a new scheduling event. The system does not commit to a single plan for the entire completion. It keeps re-evaluating feasibility as state shifts. What can I execute at this moment, with the VRAM I still have, on the hardware state I am currently in, while staying inside isolation boundaries and latency targets. That question is not answered once per request. It is answered repeatedly, at token cadence. The queue changes. The batch changes. Memory headroom changes. Cache residency changes. Workspace availability changes. The set of legal kernel and algorithm choices changes with them. And that is the point most people miss. The runtime is not just running your weights. It is continuously selecting an execution plan under constraint. The weights are inside that plan, but behavior lives in the selection. That selection is layered. Admission shapes the effective request. Scheduling forms the batch for this step. Kernel and algorithm choice binds the math that will actually run. Memory residency and allocation decide what is feasible. Isolation rules decide what sharing is allowed. Each layer contributes to the final plan, and the plan is what you are deploying. Admission and shaping Before your prompt ever reaches the model, it gets shaped. Truncation, policy injection, tool schema expansion, routing metadata, tenant tags, prefix reuse decisions, and safety transformations. If you do not know what I mean by effective request, I mean the exact token sequence that the model saw after shaping. That is the only input that matters for reproducibility. Batching and step level scheduling Modern servers do not just batch requests. They batch token steps. In a continuous batching system, token step timing feeds back into batching decisions. A slightly slower step changes who joins the next step. Who joins the next step changes shapes. Shapes change kernels. Kernels change numeric pathways. This is not an opinion. It is why vLLM exists. The PagedAttention paper describes serving as a batching problem where KV cache grows dynamically, wastes memory through fragmentation, and limits batch size. It introduces block level KV management and builds vLLM on top of it as an LLM serving system. Kernel plan selection and library behavior Once shapes are known, the runtime selects kernel variants and library algorithms that are feasible for those shapes and the workspace currently available. This is the part people underestimate. The same operator can have multiple valid implementations. The chosen implementation can change when workspace is tight, when shapes change, or when the engine wants to trade latency for throughput. Memory allocation and residency KV cache, activations, temporary buffers, workspace, graph memory, and communication buffers compete for VRAM. Under pressure, allocation patterns change. Fragmentation changes. Residency changes. Cache locality changes. All of that changes the system timeline and the feasible plan space. If you want a one line summary that is accurate in 2026 production inference, it is this. Inference is a scheduling problem plus a memory residency problem, and the model is inside that. The Scope First, Let me put it very clear. I am not claiming every deployment is nondeterministic. I am not claiming every kernel variant flips tokens. I am not claiming seeds are useless. I am making a narrower claim, the kind you can defend in an incident review without hand waving. Floating point math is not associative. Order matters. When you parallelize, you change the order of operations, and it is therefore valid for parallel results to differ from a sequential evaluation. NVIDIA states this directly in the CUDA C Best Practices Guide. CUDA also makes a foundational guarantee to the hardware and scheduler, not to your intuition. Thread blocks must be able to execute independently, in any order, in parallel or in series. That freedom is part of the programming model, not an edge case (ref). Now connect those two facts. If accumulation order changes, the last bits can change even when every operation is correct, because floating point addition is not associative. NVIDIA explicitly calls this out as well. Then layer in what serving stacks actually do. Production systems intentionally reshape execution through continuous batching and KV memory management. vLLM is a published example of this co design, where serving throughput is achieved by dynamic batching and memory-aware KV handling. Finally, bridge the nano to the semantic. When early logit margins are small, tiny numeric deltas can reorder the top candidates, and a single token flip is enough to diverge the entire completion. Here is the part that should feel a little scary, because it changes what you think you are operating. Under real load, the system is not just slower. It can enter a different execution regime. Batch composition shifts, shapes shift, workspace and residency shift, and the runtime is forced into a different set of legal kernel and algorithm choices. Nothing “breaks.” No bug is required. The system is still correct by contract. But your output is now a property of the regime you are in, not the demo you validated. That means you can pass every determinism test at idle and still ship a system that drifts only when it matters, at p95 and p99, when queues are long and memory headroom is tight. The first time you notice is often a user screenshot, an audit question, or an incident report where two replicas disagree on the same request because the runtime state was not the same. The equation principals should use in incident reviews Most teams ship with the demo mental model. y = f(x, θ) One prompt in, one checkpoint, one output. If the output changes, someone concludes the weights changed, or “AI is random.” That is not how production inference behaves, because production inference is not just a function. It is execution under constraint. Production behavior is closer to this. y = Decode( Exec(θ, x; s) ) θ is still the same weights. But the thing you actually shipped is Exec, and Exec is chosen. It is chosen per step, under the current state of the system. The behavior you observe is the behavior of the executed plan, not the abstract weights. X is not the prompt. X is the effective request. X is the exact token sequence the model saw after shaping. Truncation, policy injection, tool schema expansion, routing metadata, prefix reuse, safety transforms. All of that can change what the model actually receives. If you cannot reconstruct x, you are not replaying the request. You are replaying an approximation. Here is the minimum you should log for x, even if you cannot store raw text: # minimal "x" record: enough to reproduce or prove you cannot trace_x = { "req_id": req_id, "raw_prompt_sha256": sha256(raw_prompt), "effective_text_sha256": sha256(effective_text), "effective_tokens": len(effective_tokens), "truncated": truncated, "trunc_reason": trunc_reason, # e.g., "latency_guard", "context_cap" "decode_cfg_applied": decode_cfg, # temperature/top_p/max_tokens, etc. "shaping_events": events, # ["policy_inject:v3", "tool_schema:v2", ...] } S is not a vibe. S is the execution state that decides the math. S is what principals should demand in a postmortem, because this is what turns “it drifted” into “this plan executed under this regime.” At minimum, s includes: per-step batch composition and shape class queue delays and scheduling outcomes VRAM headroom and workspace availability cache pressure signals precision path and engine fallbacks distributed timeline signals (TP/PP latency, collective stalls) isolation posture (what batching is allowed) Why this matters: in continuous batching, time becomes part of semantics. A few milliseconds of delay changes who gets co-scheduled at the next token step. That changes shapes. Shapes change kernel/algorithm feasibility. Feasibility changes the numeric pathway. When early logit margins are thin, a tiny pathway delta is enough to flip the argmax. Here is a short, practical “s” record you can emit per decode step: # per-step "s" record: what plan ran, under what pressure step_s = { "req_id": req_id, "step": t, "batch_fp": sha256(",".join(sorted(batch_req_ids)))[:12], "shape": f"q=1,k={klen},h={heads},d={hidden},tp={tp}", "queue_ms": queue_ms, "gpu_ms": gpu_ms, "vram_free_mb": vram_free_mb, "workspace_free_mb": workspace_free_mb, "kv_regime": kv_regime, # "normal" | "pressured" | "paged" "precision_path": precision_path, # "bf16" | "fp16" | "tf32" | "fp32" "algo_id": algo_id, # backend/engine specific "kernel_variant": kernel_variant, # if available "isolation_mode": isolation_mode, # "shared" | "strict" } The incident-review translation If you only ask “what prompt did the user send” and “what weights did we run,” you are using the demo equation. You will argue about seeds, debate “randomness,” and never converge. The production equation forces the real question. Which plan executed, under which constraints, and what state pushed us into that plan. The line principals should repeat until teams internalize it is simple. Weights are static. Behavior is a property of the executed plan. And the executed plan depends on state. If you want one more operational layer that makes this feel real, add a regime marker. Regime changes are where “stability” collapses without any bug: def regime(vram_free_mb, paging_on, isolation_strict, queue_p95_ms): if isolation_strict: return "isolation_strict" if paging_on: return "paging" if vram_free_mb < 1024: return "memory_pressured" if queue_p95_ms > 50: return "queue_degraded" return "normal" When the regime changes, the feasible plan space changes. When the plan space changes, the executed math can change. That is the production reality your incident review must be able to explain. Floating point order is where small deltas are born Let’s break it down without hand waving. Finite precision makes rounding part of the computation Floating point math is not real-number math. Every add and multiply is followed by rounding to the representable format you are using. That rounding is not “noise.” It is part of the computation. Once you accept that, one consequence becomes unavoidable. Order matters. NVIDIA states the rule clearly: floating point involves rounding, and when you parallelize you can change operation order, so parallel results may not match sequential results. Why LLM inference is a perfect storm: reductions everywhere Now connect that to what an LLM does at inference time. LLM inference is reduction-heavy by design. Dot products in GEMMs, attention score accumulation, softmax normalization, layer norm statistics, even top-k selection pathways. These are not single operations. They are many partial operations combined into a final scalar or vector. In floating point, the way you combine partials is the outcome. GPU reductions are staged: partial sums, then merges A reduction on GPU is not “a sum.” It is a staged reduction of partials. On a CPU, you can imagine a left-to-right accumulation: ((((a1 + a2) + a3) + a4) + ...) On a GPU, that mental model is wrong. The GPU is built to run thousands of threads. So it computes partial sums in parallel and then merges them in stages. The staging pattern is determined by kernel design and how the backend maps the problem to hardware. Put the figure here, right after the staging idea lands. The staging depends on decisions you do not control at the prompt layer: how data is tiled into blocks how each block maps to warps how many partials each warp reduces whether it uses warp-level primitives, shared memory, or tensor core fragments how the final merge is staged across blocks Change the tile size, or the block shape, or the occupancy, and you often change the staging order. Change the staging order, and you change when rounding happens. You can get two results that are both correct under IEEE floating point rules, and they differ in the last bits. This is not a bug. It is the contract of finite-precision parallel math, applied at scale. Why the last bits move at the core level Floating point addition is not associative under rounding because rounding happens after each operation. The error introduced at each step depends on the magnitude and sign of what you are adding at that step. When you change the staging order, you change: which numbers get added together early which partial sums get rounded early how cancellation behaves when positive and negative terms interact when large and small magnitudes meet, where small values can lose representable impact That is the core mechanism behind “small deltas.” It is not mystical. It is mechanical. Why this shows up in production serving, not in your demo LLM inference is dominated by massive matrix operations and attention. Under the hood, those paths accumulate across large dimensions. An accumulation is exactly where rounding order matters most. And the server does not always run the same kernel variant for those ops. Under load, shape shifts and workspace pressure can push the backend into different implementations. Different implementations often imply different tiling. Different tiling implies different staging. Different staging implies different rounding. Different rounding implies different last bits. So even with an identical prompt, identical checkpoint, and temperature set to zero, you can still see tiny numeric differences when: batch composition changes and produces different effective shapes the engine picks a different algorithm because workspace is tighter the kernel selects a different tile path due to shape class and occupancy the GPU is in a different pressure regime, changing feasibility and scheduling behavior Those deltas are small, but they are real. And in decoding, small can be enough. The bridge from ulps to language: logits, argmax, divergence A tiny last-bit difference is often irrelevant, Until it hits a decision boundary. At decode step t, greedy decoding chooses an argmax. If the top logits are close, a small delta can swap the ordering. Once token t changes, the context changes, and the completion diverges. That is not randomness. That is deterministic branching from a slightly different numerical pathway. So the actionable takeaway is not “GPUs are nondeterministic.” It is this. Parallel math is allowed to produce multiple correct last-bit outcomes, and LLM decoding can amplify those outcomes into different text when margins are thin. CUDA scheduling makes ordering a form of runtime state CUDA makes a stronger statement than most people realize. Thread blocks must be able to run independently. It must be possible to execute blocks in any order, in parallel or in series. That is why the same kernel can execute with different inter block ordering depending on occupancy, contention, and scheduling. Now bring atomics into the picture. Atomics guarantee correctness of each update. They do not guarantee the arrival order of updates across threads and blocks. When floating point updates arrive in different legal orders, the final sum can differ in the last bits, because floating point addition is not associative. If you do not know what atomic add means, here is the useful definition. Atomic add ensures updates do not overwrite each other. It does not ensure which thread gets there first. This is the nano architecture layer that explains a lot of weirdness. Many engineers assume determinism is a property of weights. In practice, determinism is constrained by the legal reorderings of parallel execution. Logit margin is the bridge from ulps to language Now we connect the last bits to a changed sentence. At decode step t, greedy decoding picks the argmax over logits. Let the top two logits be ℓₐ and ℓ_b. Define the margin: mₜ = ℓₐ − ℓ_b A token flip happens when a small perturbation changes the ordering of these top two. If you want an operational translation, it is this. If the model barely prefers token A over token B, a tiny numeric delta can make it prefer B. Once token t changes, the rest of the completion evolves under a different context. Divergence is expected. This is why I keep pushing one instrumentation idea that sounds boring until you need it. Measure early step margins. You cannot manage stability if you never measure how close the decision boundary is. The effective request problem, the quiet killer of reproducibility Here is the pattern I see in almost every serious production investigation. The team replays the user prompt, cannot reproduce the output, and concludes the model is nondeterministic. Then the incident dies in ambiguity. And then, usually too late, someone asks the only question that matters. What did the model actually see. “In every postmortem, I ask one question before I look at weights, kernels, or seeds: what did the model actually see. If we cannot answer that, nothing else is evidence.” - Hazem Ali In production, the user prompt is not the input. It is an ingredient. By the time a request reaches the model, it has passed through a shaping pipeline that exists to keep the system safe, fast, and multi-tenant. That pipeline is not cosmetic. It can change semantics, length, and even decode behavior. The result is the only input that matters for reproducibility. The effective request. This is the same thesis you have already accepted earlier in the article. y = Decode( Exec(θ, x; s) ) If you do not know x, your replay is not valid. If you do not know s, your replay is not comparable. And if you only log the raw prompt, you are logging neither. Shaping changes semantics, not just length Truncation is the obvious one. Under load, systems often cap context length to protect latency and GPU memory. Same prompt, different truncation boundary, different effective context, different output. Nothing “random” happened. You executed a different input. But truncation is only the beginning. Policy injection can prepend or append system text that changes intent. Tool schema expansion can add hundreds or thousands of tokens and push the request over a context boundary. Routing metadata can select a different template. Prefix caching can reconstruct parts of context from cached state rather than raw text. Safety transformations can rewrite or neutralize content. Even small differences here can shift early logits when margins are thin, and this article already showed how small deltas become different tokens. The worst part is that this is silent by default. The user sees their prompt. Engineers see the prompt in logs. The model sees a different token sequence. Then everyone argues about reproducibility using the wrong input. Why this interacts with load, not just correctness Under low load, your system often has enough headroom to be generous. Longer context, fewer cutoffs, stable routing, more consistent batching, and fewer fallbacks. Under real load, shaping becomes defensive. Dynamic truncation thresholds kick in. Tool schema expansions collide with context limits. Prefix reuse behavior changes. Safety gates can become stricter. The same user text can produce a different effective request, and therefore a different output, precisely when the system is under pressure. So if you are only validating reproducibility at idle, you are validating a different system than the one you ship. What principals should require in telemetry If you want strict reproducibility, you must log the execution contract per request. Not the story. The contract. At minimum: effective token count after shaping truncation boundary and reason final merged decode config actually applied policy gates that modified prompt or decode path whether prefix cache was used, and what cache key was referenced routing template version and system message hash If you are privacy constrained, you still can log hashes and structural facts. You do not need raw prompts to diagnose effective request drift. You need verifiable fingerprints. Here is the short version in one line. If you only log the user prompt, you have not logged x. You have logged an approximation of x. And without x, you cannot claim reproducibility. You can only hope for it. Continuous batching, why time becomes part of semantics This is where principal level thinking matters. Continuous batching does not just increase throughput. It changes the execution context at each token step. Batch composition changes shapes. Shapes influence kernel selection and workspace feasibility. Those choices can change reduction structure and rounding pathways. If you want a published anchor, use vLLM. The PagedAttention paper frames high throughput serving as a need to batch many requests, but KV cache grows dynamically and wastes memory through fragmentation. It proposes PagedAttention and builds vLLM on top of it, with block level memory management and flexible sharing of KV cache to reduce memory usage. (arxiv) Here is what this really means in production. The server is selecting which requests share a step. That changes the math shapes. That changes the executed plan. That is why the same prompt behaves differently under load even at temperature zero. Algorithm selection and engine fallback The hidden variability people forget about If you have ever tried to reproduce a drift across replicas and felt like you were chasing ghosts, this is usually the layer you were missing. Libraries and engines choose, Not in a philosophical sense. In a literal, per-operator, per-shape sense. The same attention call is a fork in the road between multiple legal tactics, each with different tiling, different reduction staging, different fusion boundaries, and different temporary memory requirements. Your checkpoint is the same, your prompt is the same, your temperature is zero, and the output still moves because the executed plan moved. PyTorch says the quiet part directly. Disabling cuDNN benchmarking makes cuDNN deterministically select an algorithm, and PyTorch stresses this is different from the deterministic setting. That is the whole story in one sentence: one switch affects how the backend selects an algorithm, another affects whether the selected algorithms are deterministic. Those are separate layers, and under load they can diverge. Now go down to the core of the core. A tactic is not fast or slow. In production serving, a tactic is legal or illegal under the constraints of this token step. The constraint that forces most plan switches is not compute. It is workspace feasibility. Many high-performance kernels need scratch buffers. Some need enough contiguous space to stage tiles, reorder operands, hold partials, or run fused epilogues. When VRAM is fragmented or headroom drops, a tactic becomes impossible even if it is the tactic you validated at idle. The engine does not throw a warning. It simply selects another legal tactic. That is the first uncomfortable point. The second uncomfortable point is what makes this align perfectly with the next section. The constraint is not only “how many MB are free.” The constraint is the memory hierarchy state of the chip. Under load, two replicas can have the same free VRAM and still be in a different regime because the chip is not one pool of memory. It is HBM plus an on-die L2, plus TLBs, plus page tables, plus a fabric that is arbitrating traffic between SMs, L2 slices, and HBM controllers. When that hierarchy shifts, latency per token step shifts. And in continuous batching, a few milliseconds is not a timing detail, it is a scheduling input. This is how a performance event becomes a behavior event without any bug. The engine’s planner sees a world where a tactic that was “best” at idle is no longer best, or no longer feasible, because the chip is in a different pressure state. Your runtime is still correct. It is just operating a different plan in a different regime. One op, multiple legal kernels. The chosen tactic depends on shape class and feasibility. Now bring TensorRT into the picture, because it makes the precision dimension explicit. TensorRT states TF32 Tensor Core usage is not guaranteed and it can fall back to FP32, and it documents configuration controls around TF32. That statement is not about “precision preference.” It is about the reality that precision is part of tactic selection. Precision changes which instructions execute and how accumulation is staged. When your early logit margins are thin, a small pathway delta can swap the argmax at one step. One token flips, and the rest of the completion deterministically diverges. So “temperature zero” is not a determinism guarantee. Temperature governs sampling. It does not pin the execution pathway. If you want a more mechanical anchor, treat matmul the way NVIDIA exposes it: cuBLASLt has a preference descriptor for applying algorithm search preferences and fine-tuning the heuristic function. That is not marketing. That is the API admitting that algorithm selection is a constrained search problem. Now the part that gets rare, and the part most teams never write down. CUDA’s programming model requires that thread blocks be able to execute independently and may execute in any order, in parallel or in series. This matters here because tactic switches often change block geometry and tiling. Different block geometry changes reduction staging. Reduction staging changes where rounding happens. Even if every operation is correct, last bits can move because you legally changed the staging of partial sums. You do not need randomness. You need a different legal staging tree. Now pull security into the same frame, because it is not a separate layer in production. Security posture changes what the scheduler is allowed to do. Isolation constraints reduce batching freedom. Reduced batching freedom increases tail latency. Tail latency pushes you toward tighter admission controls and more aggressive memory behavior. That shrinks the feasible tactic set sooner. In other words, security decisions can move you across regime boundaries faster, which increases plan switching frequency. Stability becomes an SLO dimension of your security posture, not a property of your weights. This is the business consequence that shows up in the worst possible way. So here is the operational rule I use in reviews. If you cannot prove which plan ran, you cannot claim reproducibility. And that leads to the only practical addition that belongs in this section before we move into VRAM bandwidth and cache residency. VRAM bandwidth, cache residency, and why memory hierarchy becomes control plane input Let’s talk about the performance facts that quietly become behavior facts. And yes, I know how complex this gets. I have watched strong staff and principal engineers get lost here, not because they are weak, but because the system crosses too many layers at once: GPU microarchitecture, allocator behavior, kernel tactics, batching policy, and SLO-driven control loops. No single dashboard shows you the full causal chain. That is exactly why I frame it this way. It is not “performance tuning.” It is a coupled control system. So let me break it down cleanly, from the chip outward, until the behavior change becomes inevitable. NVIDIA describes H100 SXM5 as having HBM3 bandwidth around 3 TB/s and an L2 cache of 50 MB designed to reduce trips to HBM by caching repeated accesses. Most teams read that as “the GPU is fast.” In serving, it is more precise to say: the GPU gives you a memory hierarchy with regimes, and your runtime is forced to adapt to whichever regime you are currently in. The chip-level model you should carry in your head Decode is not one big matmul. It is a loop that repeatedly touches a shifting working set: KV blocks for the active sequences attention metadata (block tables, indirection, masks) sampling buffers (logits, top-k/top-p structures) runtime bookkeeping for continuous batching Those accesses are not purely streaming. They are pointer-heavy, and their locality depends on how your KV is laid out, which requests are co-scheduled, and how fragmented your memory becomes under churn. Here is the simplest mental model that is still honest: B_HBM is the number of bytes actually read from HBM during this step. B_L2miss is the number of bytes that missed L2 and therefore had to be fetched from HBM. t_translate is the address-translation tax: extra time from TLB misses and page-table walks. That last term is the one that surprises people. It’s “invisible” until it dominates. Why L2 residency becomes a control-plane input Now connect that to decode, Decode repeatedly reads KV state. If L2 hit rate drops, HBM traffic rises. When HBM traffic rises, stalls rise. When stalls rise, token-step latency shifts. When token-step latency shifts, the server changes batching decisions. This is the control loop you should keep in your head: L2 hit rate ↓ → t_step ↑ → Δt ↑ → batch composition changes → shape class changes → tactic set changes That is the bridge from “cache miss” to “different plan executed.” In continuous batching, time is not just an output metric. Time is an input into the next scheduling decision. A few milliseconds can change who gets co-scheduled at the next token step. That changes shapes. Shapes change feasible kernels and algorithms. That changes the executed math. And if early logit margins are thin, a small pathway delta can flip a token and send the rest of the completion down a different branch. Rare but matters: the translation tax that breaks the “free VRAM” illusion Two replicas can report similar free VRAM and still be in different regimes. Why? Because the chip is not “a pool of memory.” It is an on-die cache, translation structures, page tables, and a fabric that is arbitrating traffic under pressure. When KV is stored in blocks (or pages) and those blocks are scattered due to churn, you often get: worse spatial locality more distinct memory regions per step more TLB pressure more page walks Page walks are not abstract. They are memory reads. They compete with your payload reads. Under real load, this turns into self-inflicted HBM traffic. So you can be “bandwidth rich” on paper and still be “latency poor” in practice because the working set became translation-hostile. This is how a performance event becomes a behavior event without any bug. A concrete KV bandwidth sanity check If you want a back-of-the-envelope check for why decode becomes memory-shaped, use a conservative estimate. Per token step, you often need to read a large portion of KV for the active context. A rough model is: KV bytes per step ≈ 2 × B × L × H × D × s Where: B is batch size (number of sequences co-scheduled in the step) L is current context length (tokens already in KV) H is the number of attention heads (or KV heads, depending on the model) D is head dimension s is bytes per element (2 for fp16/bf16, 1 for int8, etc.) The factor 2 accounts for K and V. Even if your kernel is compute-efficient, you are still moving a lot of bytes. If locality collapses and L2 misses rise, you shift into an HBM-limited regime fast. That is the mechanical reason your p95/p99 step time moves under load, even with the same checkpoint and temperature. Business impact, stated plainly This is why drift shows up where it hurts: p95 and p99. At idle, L2 residency is generous, fragmentation is lower, translation pressure is calmer, and step time is stable. Under load, residency collapses, translation tax rises, allocator feasibility tightens, step time stretches, and your control plane adapts by changing batching and shapes. That can move you into different execution plans without any model change. An enterprise buyer does not care whether you call it “L2 miss driven plan churn.” They care that two identical requests disagree and you cannot explain it. So the takeaway I want principals to internalize is simple: In continuous batching, memory hierarchy state is control-plane state. It shapes latency. Latency shapes batching. Batching shapes shapes. Shapes shape feasibility. Feasibility shapes the executed plan. That is how “performance” becomes “behavior.” Multi node tensor parallel, the execution plan extends across the fabric Once you go multi-node tensor parallel, you add a second execution plane that most teams underestimate. You are no longer operating only a GPU runtime. You are operating a distributed timeline. And the timeline is not a background detail. In continuous batching, the timeline becomes a control input that reshapes batching, shapes, and eventually the executed plan. Let me be precise about what I am claiming, and what I am not. I am not going to claim collectives reorder arithmetic inside a kernel. That would be sloppy. The correct claim is this: Distributed synchronization changes the timeline. The timeline changes admission and batching. Batching changes shapes. Shapes change which plans are legal. That’s enough to explain why the “same prompt, same checkpoint, temp=0” can drift only under real load. The minimal equation you should carry At each decode step, your latency is no longer “GPU time.” It’s GPU time plus fabric time: t_step ≈ t_compute + t_comm + t_sync And the part that hurts is that t_comm and t_sync are not stable. They are affected by contention, queueing, stragglers, and topology. A useful mental model for the communication piece is the classic latency–bandwidth form: t_comm(message) ≈ α + (n / β_eff) α is the per-collective startup and synchronization overhead n is bytes moved β_eff is the effective bandwidth you actually get under contention In isolation, this looks like performance math. In a continuous batching server, this becomes behavior math, because t_step feeds back into the next scheduling decision. What actually happens in multi-node TP at token cadence Tensor parallelism shards the model across devices. Every token step requires cross-device coordination for some portion of the layer execution. In practice, this means collectives become part of the critical path. NCCL’s collective ops are explicit about the semantics: for example, AllReduce reduces values across ranks and returns identical results to all ranks. That tells you what the runtime must do: it must wait for coordination across ranks before progressing. So the decode loop becomes: execute local compute for this step hit a collective boundary wait for the slowest rank to finish and for the fabric to deliver proceed That “slowest rank” detail is the piece people feel but rarely name. In distributed inference, p99 is often a straggler story. A single congested link, a slightly delayed rank, or a transient fabric stall turns into a global stall because collectives synchronize progress. In other words, a multi-node TP system behaves like a coupled oscillator: the fastest GPU is still gated by the slowest collective. Why this changes the executed plan, not just the latency Here’s the bridge to the thesis of the whole article. In a continuous batching server, you do not just execute requests. You continuously reform microbatches at token cadence. That means step time affects who joins the next step. And in multi-node TP, fabric jitter is one of the biggest sources of step-time variability. So when comm jitter shifts t_step, it shifts the schedule: queue delay changes microbatch membership changes effective shape class changes workspace feasibility changes tactic choice changes You already established earlier that a changed shape class can force a different tactic set. Multi-node TP adds a new reason shape churn happens: not only GPU pressure, but fabric timing pressure. So the claim stays clean and defensible: Distributed synchronization doesn’t need to change arithmetic to change behavior. It only needs to change the timeline that drives batching. Chip-to-fabric reality: why infrastructure details belong in the reproducibility record At this scale, the infrastructure is part of the runtime. According to Azure Docs, Azure’s ND H100 v5 series is explicitly positioned for tightly coupled scale-up and scale-out Generative AI and HPC workloads, and it’s built around the idea that the fabric matters, not just the GPUs: If you are running multi-node TP in production, treat fabric telemetry as part of your reproducibility record. Not because it is fun. Because it changes the system timeline that drives batching. A practical minimum is to track per-step: collective type on the critical path (e.g., all-reduce / all-gather) comm time and jitter (p50/p95/p99 per step window) rank skew (max(rank_time) − min(rank_time)) effective bandwidth estimate (n / t_comm) retransmit / congestion signals if your stack exposes them a “fabric regime” marker: normal vs congested vs degraded When drift becomes expensive This is one of the reasons enterprise teams report the most confusing failures only at load. At idle, your timeline is stable, your microbatches are stable, your shapes are stable, and your plan selection is stable. Under real load, the fabric introduces jitter, jitter reshapes batching, batching reshapes shapes, and shapes reshape the executed plan. Now two replicas can disagree, not because the model changed, but because the timeline differed. That shows up as: inconsistent answers across replicas in high-stakes workflows reproducibility failures during audits and incident reviews “regressions” after scaling out, even with the same checkpoint and code support costs and credibility loss because you cannot explain why behavior changed only at p95/p99 So the operational sentence I want you to carry into your postmortems is: In multi-node tensor parallel inference, the execution plan extends across the fabric. If you do not log the fabric timeline, you are missing part of the runtime state that decides which plan was feasible. Where Infrastructure Stops Being “Just Infrastructure” Once you accept the thesis of this article, one conclusion becomes unavoidable: cloud choices are not just cost and convenience decisions. They shape which execution regimes your runtime will enter under pressure. At scale, you are no longer buying “GPUs.” You are buying: A fabric and topology that holds up under synchronized token-step collectives A VM family with predictable characteristics for tightly coupled scale-out workloads (the kind multi-node inference actually is) An isolation posture that can be enforced in hardware when your threat model requires it, without hand-waving away the runtime implications First-class observability for GPU behavior, not just CPU and request traces, so you can correlate drift with the state variables that caused it (for example, exporting NVIDIA DCGM metrics into managed Prometheus and Azure Managed Grafana on AKS). This is the quiet reason certain platforms feel “more stable” in production. Not because the model is different, but because the runtime state is easier to constrain, measure, and explain when the underlying infrastructure is designed for the exact regime you’re operating in. Quantization effects on execution paths and causal stragglers in multi-node TP Let me be direct about what most articles miss when they discuss distributed inference at scale. The conversation typically stops at "how many GPUs" and "what's the bandwidth." That's not wrong. It's just incomplete. What's missing is the interaction between quantization-induced plan churn and straggler amplification in the collective path, two forces that quietly reshape your execution regime under VRAM pressure and fabric contention. These are not theoretical curiosities. They are production realities at 100+ GPU scale, the kind of scale where you can no longer afford to treat quantization as a "precision choice" or stragglers as a "latency outlier." At that scale, they become causal inputs to your runtime's decision surface. Quantization variability: not just precision, but plan selection When teams talk about INT8 or FP8 quantization, the conversation usually centers on memory savings and throughput gains. That's the marketing layer. The execution layer is more nuanced: quantization changes which kernels are legal, where fusion boundaries land, and how reduction trees are staged. Here's what I mean in concrete terms. Under VRAM pressure, your serving stack may need to requantize activations mid-forward-pass to stay within memory bounds. That requant step is not "free" in the plan sense. It introduces: dequant/requant cycles that break fusion opportunities you had in the FP16 path new non-associative operations in the reduction tree, where rounding happens at different stages fallback paths when the quantized kernel variant lacks workspace or doesn't support the current shape class Let me state this in the language of the article's thesis: quantization is not a data type. It is a tactic constraint that reshapes the feasible plan space. Memory pressure can force dequant/requant cycles, change fusion boundaries, and trigger fallback kernels with different reduction staging, producing last-bit differences that can flip tokens during decoding. The practical consequence? Two replicas running "the same quantized model" can execute different kernel variants when one is memory-pressured and the other is not. The memory-pressured replica may be forced into a fallback path with different reduction staging. Different staging means different rounding order. Different rounding order means different last bits. And in decoding, last bits can become different tokens. I've watched incident reviews where teams assumed INT8 was "deterministic" because they set the quantization scheme once at export time. What they missed is that the runtime's quantization pathway depends on the state of VRAM fragmentation, workspace availability, and kernel preference histograms, exactly the regime-dependent variables we've been building toward throughout this article. If you're operating at scale, instrument this. Track: per-step kernel selection via cuBLASLt preference descriptors dequant/requant cycle counts when memory pressure rises fallback events when preferred quantized tactics become infeasible whether the executed plan matched the "expected" quantization pathway This is rare telemetry. Most teams never see it because they're not running large enough clusters under sustained pressure. But once you cross into 100+ GPU inference workloads, quantization-induced plan churn becomes visible in your p99 drift signatures. Causal stragglers: when one rank's fallback stalls the collective Now let's talk about the fabric-scale pathology that couples with everything we just discussed: head-of-line blocking in distributed tensor parallelism. You already know from the multi-node TP section that collectives synchronize progress. The fastest rank waits for the slowest. That's the contract. What's less documented—and what I've only seen formalized in internal NVIDIA serving postmortem templates—is how a single rank's kernel fallback can become a collective-wide straggler, and how that straggler amplifies through the batching feedback loop. Here's the causal chain: One rank enters memory pressure. Maybe fragmentation is worse on that device, maybe it's handling a slightly different KV layout due to request assignment. That rank falls back to a slower tactic. The preferred kernel requires workspace. Workspace isn't available. The engine selects a legal fallback. The fallback kernel takes longer. Not by seconds—by milliseconds. But in a collective, milliseconds matter. The collective waits. AllReduce can't proceed until all ranks contribute. The straggler becomes the bottleneck. Step time stretches. The stretched step reshapes the next batch in continuous batching. Different batch, different shapes, different feasibility. The cycle repeats. Now multiple ranks may be in fallback paths. The p99 drift you're seeing isn't random—it's a feedback loop. This is what I call a causal straggler: not just a slow rank, but a rank whose performance degradation causally reshapes the execution regime of the entire TP group. And here's where quantization and stragglers intersect. If one rank is under more VRAM pressure and is forced into more frequent dequant/requant cycles, it becomes the straggler. Its quantization pathway differs from the other ranks—not because the model changed, but because the memory regime changed. That difference in pathway becomes a difference in step time. That difference in step time becomes a collective stall. That stall becomes a batching change. That batching change becomes a new plan. The output drifts, and you're left wondering why "the same checkpoint at temperature zero" produced different text only under load. The answer is: you weren't in the same execution regime. You were in a regime where one rank's memory pressure caused a straggler, the straggler caused a timeline shift, and the timeline shift caused a plan change. Rarity value: why this knowledge is elite production battle scars Let me be honest about why these gaps are rare. Most teams never operate at the scale where these effects dominate. If you're running inference on 8 GPUs, you might see hints of this. At 100+ GPUs with multi-node TP and continuous batching under sustained load, it's no longer a hint—it's the signature. The teams that do operate at this scale track: cuBLASLt preference histograms to detect when algorithm selection is churning across steps NCCL timeline traces to identify straggler signatures and correlate them with per-rank memory state per-rank kernel fallback events to see when one device is operating a different plan than its peers quantization pathway divergence across ranks under pressure This is the telemetry that doesn't show up in tutorials. It shows up in postmortems at hyperscaler SLO thresholds, where p99 latency violations trigger incident reviews and someone finally asks: "Why did replica 3 disagree with replica 1 only during the peak load window?" The article you're reading now covers single-node memory regimes beautifully. What bridges to 10/10 elite production knowledge is this: fabric-scale causality. The understanding that in multi-node TP, your execution regime is not just shaped by your GPU's memory state—it's shaped by the worst GPU's memory state, because collectives couple everyone's timeline. That's the gap. That's the rarity value. And if you're building or operating inference at 100+ GPU scale, that's the layer where your next outage is hiding. Peak depth: wavefront divergence, tensor core fragmentation, NCCL backpressure, and ISR collision Everything above operates at the principal and staff engineer level. What follows is the layer below that—the chip architect handoff, where you stop talking about "plans" in the abstract and start talking about warp stall cycles, tensor core fragment occupancy, NCCL retransmit chains, and memory evaporation under replication pressure. I'm writing this section because it's the part I never see published outside internal design reviews, and because these are the exact pathologies that turn a well-architected inference cluster into a system that disagrees with itself only during peak traffic. "Most engineers debug the layer they understand. The system breaks at the layer they don't. In production inference, that layer is almost always the one where microarchitecture meets scheduling meets the fabric." — Hazem Ali Wavefront divergence in decode attention kernels Let me take you inside the warp. In SIMT execution, a warp is 32 threads executing in lockstep. When all threads follow the same control path, you get full utilization. When they diverge—different threads take different branches—the warp must serialize both paths. That's textbook GPU architecture. What's not textbook is how this interacts with paged KV attention in production decode loops. In a paged KV system (the exact kind vLLM introduced), KV blocks are scattered across VRAM. Different sequences in the same microbatch may have their KV blocks in different residency states: some hot in L2, some cold in HBM, some partially evicted under paging pressure. When the attention kernel issues loads for KV blocks, threads within the same warp can stall at different rates depending on which blocks they're accessing and where those blocks reside. This creates a subtle but measurable pathology: Lane divergence inside the attention kernel. Not control-flow divergence in the traditional sense, but memory-latency divergence: some lanes return fast (L2 hit), some stall (HBM fetch), and the warp can't retire until the slowest lane completes. Register pressure amplification. When warps stall, the SM must keep their register state live. Under heavy stalling, register pressure rises, which can force the compiler to spill to local memory (which lives in L2/HBM). Spills create more memory traffic, which creates more stalls. It's a feedback loop at the microarchitectural level. Measurable p99 step variance in identical-shape batches. This is the part that confuses teams. Two consecutive decode steps with the same batch size and the same sequence lengths can have different step times, because the KV block residency pattern differed. The shape was identical. The memory topology was not. If you want to see this in practice, the tool is Nsight Systems. What you're looking for: # Nsight Systems trace analysis: partition warp stall cycles # Look for these stall reasons in the GPU metrics view: # - smsp__warps_issue_stalled_long_scoreboard → memory dependency stalls # - smsp__warps_issue_stalled_short_scoreboard → register dependency stalls # - smsp__warps_issue_stalled_no_instruction → instruction cache miss # # Correlate with: # - l1tex__t_sectors_pipe_lsu_mem_global_op_ld → global load sectors (KV fetches) # - lts__t_sectors_srcunit_tex_op_read_hit_rate → L2 hit rate during attention # # The diagnostic signal: when stall_long_scoreboard spikes correlate with # L2 hit rate drops, you're seeing KV residency divergence across warps. The stall partition tells you why the warp stalled. When you see long_scoreboard stalls dominating during attention kernels—and you see them correlating with L2 miss rate fluctuations—you're observing exactly the KV residency divergence I'm describing. The warp is waiting for scattered KV blocks, and the scatter pattern changes with every batch because paging decisions are state-dependent. This is how "identical shapes" produce different timelines. The shape is the same. The KV block map is not. And the block map is a function of runtime allocation history—the same state-dependent variable that drives everything else in this article. Tensor core fragment utilization collapse under shape churn Now let's go inside the tensor cores themselves. H100 and Blackwell tensor cores operate on matrix fragments—fixed-size tiles that map directly to the hardware's matrix multiply-accumulate units. On H100, the native fragment sizes for FP16 are typically 16×16×16 (m×n×k). When your operand dimensions align cleanly with fragment boundaries, you get full utilization. When they don't, you get fragment waste: the hardware still executes full fragments, but some of the lanes carry padding zeros. In continuous batching, shape churn is the norm. Your microbatch dimensions change at token cadence. And this is where a subtle but devastating efficiency collapse hides. Consider two microbatches that arrive one step apart: # Step t: B=16, L=2048 → GEMM shape aligns cleanly with 16×16 fragments # Fragment utilization: ~98% # cuBLASLt selects: WMMA-based kernel (tensor core native) # # Step t+1: B=17, L=2047 → GEMM shape straddles fragment boundaries # Fragment utilization: drops below 25% on trailing tiles # cuBLASLt selects: fallback to non-WMMA FP16 kernel # (or WMMA with heavy padding, depending on heuristic) The difference is one sequence in the batch and one token in context length. The performance consequence is that the runtime switches from tensor core native execution to a scalar FP16 path. That's not a minor variant. That's a fundamentally different instruction mix, a different reduction tree, and a different accumulation order. The ulp deltas that result from this switch don't stay contained in the GEMM output. They propagate forward through layer normalization—which is itself a reduction over the hidden dimension. Layer norm amplifies small differences because it divides by a variance term computed from the same values. A tiny shift in the GEMM output becomes a slightly different variance, which becomes a slightly different normalization, which becomes a slightly different input to the next layer's attention. You can observe this directly via cuBLASLt's algorithm preference reporting: # cuBLASLt algorithm preference histogram (conceptual) # Track per-step which algorithm ID was selected for the primary GEMM # # Healthy (stable shapes): # algo_id=42 (WMMA_TENSOR_OP_HMMA_16816) → 99.2% of steps # algo_id=17 (SIMT_FP16_SPLITK) → 0.8% of steps # # Under shape churn (continuous batching, mixed lengths): # algo_id=42 (WMMA_TENSOR_OP_HMMA_16816) → 61.3% of steps # algo_id=17 (SIMT_FP16_SPLITK) → 22.1% of steps # algo_id=31 (WMMA_TENSOR_OP_PAD16) → 16.6% of steps # # When algo_id distribution churns, your reduction tree is churning. # When your reduction tree churns, your last bits are churning. # When your last bits churn under thin margins, your tokens can flip. That histogram is the smoking gun. When you see algorithm preference distribution widening under load, you're watching the tensor cores get destabilized by shape churn. The fix isn't "use bigger batches." The fix is to understand that continuous batching creates a shape distribution, not a fixed shape, and that shape distribution maps directly to a tactic distribution, which maps directly to a ulp distribution. NCCL causal backpressure chains across TP+DP pods Now scale this to the fabric. Take an 8×TP + 4×DP pod: 32 GPUs total, where every token step requires AllReduce across the 8-way TP group, and gradient synchronization (or KV redistribution in some architectures) across the 4-way DP group. Here's the causal backpressure chain I've traced in production, laid out as a timeline: Rank 5 (of 8 TP ranks) hits a quant/dequant stall. Its KV blocks are fragmented, workspace is tight, and the runtime forces a dequant cycle mid-attention. That adds ~1.2ms to this rank's compute. AllReduce stalls on Rank 5. The other 7 ranks complete their portion and issue their NCCL send. Rank 5 hasn't arrived yet. NCCL's ring/tree protocol can't progress past this rank. Effective t_sync inflates by 2× compared to the no-straggler baseline. P2P retransmit triggers. Under some fabric topologies and congestion states, the delayed arrival from Rank 5 can cause NCCL to hit internal retry logic on the NVLink or InfiniBand path. This is not a "network error"—it's the transport protocol managing flow control under backpressure. But it adds latency jitter that is invisible unless you're tracing at the NCCL bootstrap level. vLLM scheduler reacts to the stretched step. The scheduler sees that step t took 2× longer than expected. Under its latency-aware admission control, it drops batch size from 32 → 12 to protect SLO. Smaller batch means different shapes. Different shapes mean different tactics. The plan changes. The batch size drop propagates. With batch size at 12, queued requests wait longer. Queue pressure builds. When the scheduler recovers and re-admits, the burst creates shape churn. Shape churn destabilizes tensor core fragment utilization. The system is now in a different execution regime—triggered by one rank's memory fragmentation. That is a causal backpressure chain. Not a latency spike. Not a network blip. A causally connected sequence where a microarchitectural event on one device reshapes the execution plan across the entire pod. To trace this, you need NCCL bootstrap traces with NVTX domain annotations: # NCCL tracing with NVTX domains for causal analysis # # Environment setup for trace collection: # NCCL_DEBUG=INFO # NCCL_DEBUG_SUBSYS=INIT,COLL,P2P # NSYS_NVTX_DOMAINS=nccl,cuda,cublas # # In Nsight Systems, correlate: # 1. Per-rank kernel duration (cuda domain) — identify the straggler # 2. NCCL collective start/end (nccl domain) — measure t_sync inflation # 3. P2P transport events (nccl/P2P) — detect retransmit/backpressure # 4. Scheduler batch decisions (application NVTX) — see batch size reaction # # The causal signal: when rank N's kernel duration spike aligns with # NCCL collective inflation across all ranks, followed by batch size # reduction in the scheduler, you have a causal backpressure chain. # # Regex for filtering straggler events in nsys export: # grep -E "ncclAllReduce.*duration_us > (2 * median_duration)" trace.sqlite # → correlate timestamp with scheduler batch_size change events This is the telemetry that separates "we think there was network jitter" from "Rank 5's dequant stall caused a 2× collective inflation that forced the scheduler to halve batch size, which shifted the shape class into a non-WMMA tactic for the next 47 steps." The first is a guess. The second is a causal explanation. And in an incident review at scale, only the second one survives. ISR + checkpoint overlap pathology: memory evaporation under replication pressure This is the deepest pathology in this article, and it almost never surfaces below 512 sequences per second. Large-scale inference deployments use incremental state replication (ISR) for fault tolerance: rather than checkpointing the entire model state, you replicate KV cache deltas and scheduler state to a standby node incrementally, so failover is fast. Separately, many systems run async checkpointing for recovery: periodic snapshots of model and optimizer state written to persistent storage, overlapped with inference to avoid blocking the decode loop. Under normal load, these two systems coexist peacefully. ISR replicates small deltas. Checkpointing writes in the background. Memory headroom is sufficient for both. Under paging pressure—the exact regime we've been discussing throughout this article—they collide. Here's the pathological interaction: The system is under VRAM pressure. KV blocks are being paged (allocated, evicted, re-allocated) at high frequency. Memory headroom is thin. ISR kicks in. It needs to replicate recent KV deltas to the standby. To do this, it must pin certain KV blocks in memory while it serializes and transmits them. Async checkpointing overlaps. The checkpoint writer is also holding references to memory regions it's snapshotting. Under normal conditions, this is fine—there's enough headroom. Under paging pressure, the checkpoint's memory holds compete with ISR's memory holds. Memory evaporation. The combined pinning from ISR + checkpointing temporarily removes KV blocks from the pool available to the decode loop. The pager sees available blocks drop. It may be forced to evict active KV blocks—blocks that are needed for in-flight sequences—to make room. Evicted blocks must be recomputed. When a sequence's KV is evicted mid-collective (during an AllReduce, for example), the rank that lost its KV must recompute it. That recompute makes this rank the straggler. And we already know what stragglers do to the collective timeline. The straggler triggers the full backpressure chain. Collective stall → batch size reduction → shape churn → tactic churn → output drift. All caused by a fault-tolerance mechanism designed to keep you safe. ISR pins KV deltas for replication while async checkpointing pins regions for snapshotting. Under paging pressure, the combined pinning shrinks the decode-available KV pool, forces evictions and recompute, creates stragglers, and cascades into collective stalls → batch reduction → shape/tactic churn → p99 output drift. I call this memory evaporation because from the decode loop's perspective, VRAM that was available simply vanishes for a window of time. The blocks are still physically present—they're held by ISR and the checkpointer, but they're not available to the runtime. The effect is identical to a sudden drop in free VRAM, and the runtime reacts accordingly: it enters a pressured regime. This is why the pathology rarely surfaces below 512 seq/s. At lower throughput, there's enough headroom that ISR and checkpointing never compete meaningfully with the decode loop's memory needs. At high throughput under sustained load, the margins collapse, and the three systems—decode, ISR, checkpoint—start fighting over the same memory. The fix is not "turn off ISR." The fix is to coordinate memory budgets across these three subsystems and to treat ISR and checkpointing as memory consumers that participate in the regime calculation. If your regime function doesn't account for replication and checkpoint holds, it's underestimating pressure, and your system will surprise you at exactly the scale where fault tolerance matters most. # extended regime function accounting for replication and checkpoint pressure def regime_extended(vram_free_mb, paging_on, isolation_strict, queue_p95_ms, isr_pinned_mb, ckpt_pinned_mb, kv_pool_total_mb): effective_free = vram_free_mb - isr_pinned_mb - ckpt_pinned_mb effective_ratio = effective_free / kv_pool_total_mb if kv_pool_total_mb > 0 else 1.0 if isolation_strict: return "isolation_strict" if effective_ratio < 0.05: return "memory_evaporation" # ISR+ckpt collision if paging_on: return "paging" if effective_free < 1024: return "memory_pressured" if queue_p95_ms > 50: return "queue_degraded" return "normal" That "memory_evaporation" regime is the one you never see at idle. It only appears when throughput is high enough that ISR frequency, checkpoint frequency, and decode memory demand all peak simultaneously. And when it appears, it doesn't show up as an OOM. It shows up as a straggler, which shows up as a collective stall, which shows up as a batch size drop, which shows up as a shape change, which shows up as output drift at p99. That's the full causal chain from fault tolerance to token flip. The chip-architect handoff These four pathologies, wavefront divergence, tensor core fragmentation, NCCL backpressure, and ISR collision are what elevate from principal-level operational insight to chip-architect-level systems thinking. They share a common structure: A microarchitectural or infrastructure event occurs that is invisible at the API layer. The event changes the timeline or the memory topology, not the "inputs." The changed timeline or topology feeds back into scheduling, shaping, or tactic selection. The feedback loop produces a different executed plan. The different plan produces a different result that is correct by contract but different by observation. If you're instrumenting at this depth, you're not debugging anymore. You're operating a system where the observability itself is part of the architecture. And if you're carrying the thesis of this article to its logical conclusion: the executed plan is not just a function of the GPU state. It's a function of the warp state, the fragment state, the fabric state, and the replication state—all coupled through continuous batching at token cadence. Security is not a layer, it changes execution Now let’s go deep, because this is where a lot of principal level reviews go wrong. Teams talk about security as confidentiality and correctness as something separate. In multi tenant inference, they couple. IOMMU based GPU isolation and DMA remapping Microsoft documents IOMMU based GPU isolation as a technique to manage how GPUs access system memory, improving security and stability: Microsoft also documents IOMMU DMA remapping, describing how GPUs access memory through logical addresses that are no longer mapped one to one, enabling logically contiguous address ranges through translation: This matters for two reasons. First, it is a real hardware enforced boundary, not a policy checkbox. Second, boundaries introduce overhead and constraints. Constraints change what is allowed. Allowed execution choices shape the plan space. Confidential computing on H100 NVIDIA states that H100 is the first GPU to introduce support for confidential computing and that it can be used in virtualized environments with VMs or Kubernetes based deployments. Azure has also published general availability of confidential VMs with H100, which is the practical deployment side of this posture: Now the key architectural point. When you turn on stronger isolation, you often restrict sharing. You restrict cross tenant microbatching. You add attestation requirements. You change how memory is mapped and protected. That can reduce throughput. Reduced throughput moves you closer to regime boundaries. When the system crosses a regime boundary, the executed plan changes. Security posture becomes an SLO dimension. If you do not test it, you do not know what system you are running. GPU cache side channels, why sharing is not a theoretical risk There is published research that treats GPU caches as a leakage surface. The USENIX Security 2024 paper Invalidate plus Compare presents a timer free GPU cache attack primitive. I will not provide attack recipes. You do not need them to understand the conclusion. If your threat model includes untrusted co tenants, shared microarchitectural resources matter. If you respond by increasing isolation, your execution constraints change. That changes performance and can change the execution regimes your serving stack enters. Security and runtime behavior are coupled. State collapse, the phase transition that looks like model instability If you don’t know what state collapse is, imagine a highway that looks perfectly calm at 2 a.m. Every lane is open. Every car keeps its distance. Your ETA is stable. You run the same route ten times and you get the same arrival time. Then 8:30 a.m. hits. Nothing “broke” in the highway. The asphalt is the same. The speed limit is the same. The cars are the same. But the system crosses a density threshold. One small brake tap becomes a shockwave. Lanes start interacting. Merges become bottlenecks. A single slow truck creates a queue that ripples backwards. Suddenly your ETA isn’t a property of your car anymore. It’s a property of the traffic regime. That is state collapse in production inference. At low load, the system behaves stable. At high load, output drift appears. And teams mislabel it as “model instability,” or “LLM randomness,” or “temperature drift.” Most of the time, it is none of that. It is a phase transition in the runtime. You didn’t change weights. You crossed a regime boundary. What collapses, exactly State collapse is not “everything gets slower.” It is when the control plane loses the degrees of freedom it was using to keep execution consistent. Under low load, the runtime has slack: enough VRAM headroom to keep preferred tactics feasible enough cache residency to keep step times predictable enough scheduling flexibility to keep microbatch composition stable enough workspace contiguity to avoid algorithm fallbacks enough fabric stability (in multi-node TP) to keep step cadence tight Under high load, that slack disappears. The runtime stops being a “fast executor” and becomes a “survival scheduler.” And once it crosses that boundary, it starts making different decisions that are all valid, all correct by contract, and all capable of shifting outputs. This is why it feels like instability: the model hasn’t changed, but the executed plan has. Why this shows up as output drift, not just latency drift Because decoding is a branching process. A small numerical difference that does nothing in a benchmark can flip a token if the margin is thin. One flip changes the context. The context changes the next logits. Now you’re on a different path. So the runtime doesn’t need to be “wrong” to produce different text. It just needs to execute a different legal plan under a different legal regime. That is the whole thesis of this article, condensed into one sentence: Weights are static. Behavior is a property of the executed plan. The executed plan is a function of state. The common triggers that push systems into collapse You can treat these as the usual “threshold crossings” that shrink the feasible plan space: Memory headroom shrinks → feasible tactic set shrinks Preferred kernels often require workspace. When headroom or contiguity drops, tactics become illegal and the engine selects other tactics. Cache residency collapses → stalls rise → step timing drifts L2 hit rate drops, HBM traffic rises, and decode steps stretch. In continuous batching, stretched steps reshape the next batch. Continuous batching shifts the mix and shapes Under load, microbatch membership changes at token cadence. Shape class changes are not cosmetic; they change kernel feasibility. Framework and engine algorithm selection changes depending on settings Autotuning, benchmarking, and backend heuristics mean the “same op” can legally choose different algorithms. Under pressure, the best choice can become infeasible. CUDA execution permits ordering freedom and floating point order sensitivity remains true Parallel staging and legal reordering can shift last bits. Under thin margins, last bits can become different tokens. Nothing here requires a bug. This is what “execution under constraint” looks like. The incident question that stops the hand-waving If you want a more honest incident question, use this: Which execution regime ran, and what constraints pushed us into it? Not “was the prompt the same.” Not “were the weights the same.” Not “did we set temperature to zero.” Regime first. Because state collapse is not a mystery. It’s a threshold. And once you learn to name the threshold, you can instrument it, test it, and stop being surprised by it at p95 and p99. A reproducibility protocol that works for principals, not demos Logging prompts is not reproducibility. It is wishful thinking. If you want to be able to defend behavior, you need to reconstruct the execution state. Log the execution contract Per request, log: effective input length after shaping truncation boundary and reason decode configuration actually applied admission time, queue time, GPU time per step batch fingerprint or at minimum batch identity and shape class memory headroom watermark and whether you were in a pressured allocation regime engine precision mode settings and any fallback relevant flags cuDNN benchmark and deterministic settings if relevant isolation posture, including whether cross tenant batching is permitted Track margins early Track top two logit margins for early steps. Use it as a stability budget. If the margin collapses under a certain prompt family, treat that as a risk surface. Not every prompt is equally stable. Test under regimes, not at idle Do not run determinism tests at idle and call it solved. Test under: sustained concurrency mixed sequence lengths continuous batching realistic memory pressure real isolation posture If you do not do this, you are validating a different system than the one you ship. vLLM’s paper exists precisely because these conditions define the serving problem. Closing If you want production LLM behavior to be explainable, stop treating the model as the whole system. Weights are static. Executed math is selected under constraint. Behavior lives in the gap. You did not deploy weights. You deployed a physics constrained runtime that contains weights. And that runtime is allowed to change the executed plan, because floating point order matters, CUDA scheduling freedom is part of the contract, engines can choose precision pathways, and serving stacks intentionally reshape batching and memory. Acknowledgments While this article dives into the hidden memory mechanics that shape LLM behavior under load, I’m grateful it was peer-reviewed and challenged before publishing. A special thanks for Hammad Atta and Abhilekh Verma for peer-reviewing this piece and challenging it from a security-and-systems angle. If this article resonated, it’s likely because it describes a reality many teams encounter only after an incident: production LLM behavior is a property of the executed plan, and the executed plan is a function of state. If you’re running production inference at scale and observing behavior shifts under load—especially in tail-latency regimes, I’m happy to connect on LinkedIn. I’m open to substantive technical discussion. Thank you for reading. I hope this helps you surface the hidden variables in serving and turn them into telemetry, controls, and repeatable postmortem evidence. And if you’re seeing similar regime transitions or plan churn in your own deployments, I’d be interested to hear how it presents in your stack. — Hazem Ali Microsoft AI MVP, Distinguished AI & ML Engineer / Architect268Views0likes0CommentsCreating a Fun Multi-Agent Content Strategy System with Microsoft Agent Framework
This tutorial walks you through building a multi-agent content strategy system using Microsoft's AutoGen framework. Three specialised AI agents — a Content Creator, an Algorithm Simulator, and an Audience Persona — collaborate to help gaming content creators pressure-test their social media posts before publishing. Using live Google Trends data and platform-specific scoring rubrics for TikTok, Twitter/X, YouTube, and Instagram, the system generates content, predicts how each platform's algorithm would distribute it, and simulates authentic audience reactions. The tutorial covers core multi-agent patterns including role specialisation, structured evaluation, iterative feedback loops, and resilient tool integration — all running on GitHub Models' free tier.435Views0likes0CommentsAdvanced Function Calling and Multi-Agent Systems with Small Language Models in Foundry Local
Advanced Function Calling and Multi-Agent Systems with Small Language Models in Foundry Local In our previous exploration of function calling with Small Language Models, we demonstrated how to enable local SLMs to interact with external tools using a text-parsing approach with regex patterns. While that method worked, it required manual extraction of function calls from the model's output; functional but fragile. Today, I'm excited to show you something far more powerful: Foundry Local now supports native OpenAI-compatible function calling with select models. This update transforms how we build agentic AI systems locally, making it remarkably straightforward to create sophisticated multi-agent architectures that rival cloud-based solutions. What once required careful prompt engineering and brittle parsing now works seamlessly through standardized API calls. We'll build a complete multi-agent quiz application that demonstrates both the elegance of modern function calling and the power of coordinated agent systems. The full source code is available in this GitHub repository, but rather than walking through every line of code, we'll focus on how the pieces work together and what you'll see when you run it. What's New: Native Function Calling in Foundry Local As we explored in our guide to running Phi-4 locally with Foundry Local, we ran powerful language models on our local machine. The latest version now support native function calling for models specifically trained with this capability. The key difference is architectural. In our weather assistant example, we manually parsed JSON strings from the model's text output using regex patterns and frankly speaking, meticulously testing and tweaking the system prompt for the umpteenth time 🙄. Now, when you provide tool definitions to supported models, they return structured tool_calls objects that you can directly execute. Currently, this native function calling capability is available for the Qwen 2.5 family of models in Foundry Local. For this tutorial, we're using the 7B variant, which strikes a great balance between capability and resource requirements. Quick Setup Getting started requires just a few steps. First, ensure you have Foundry Local installed. On Windows, use winget install Microsoft.FoundryLocal , and on macOS, use bash brew install microsoft/foundrylocal/foundrylocal You'll need version 0.8.117 or later. Install the Python dependencies in the requirements file, then start your model. The first run will download approximately 4GB: foundry model run qwen2.5-7b-instruct-cuda-gpu If you don't have a compatible GPU, use the CPU version instead, or you can specify any other Qwen 2.5 variant that suits your hardware. I have set a DEFAULT_MODEL_ALIAS variable you can modify to use different models in utils/foundry_client.py file. Keep this terminal window open. The model needs to stay running while you develop and test your application. Understanding the Architecture Before we dive into running the application, let's understand what we're building. Our quiz system follows a multi-agent architecture where specialized agents handle distinct responsibilities, coordinated by a central orchestrator. The flow works like this: when you ask the system to generate a quiz about photosynthesis, the orchestrator agent receives your message, understands your intent, and decides which tool to invoke. It doesn't try to generate the quiz itself, instead, it calls a tool that creates a specialist QuizGeneratorAgent focused solely on producing well-structured quiz questions. Then there's another agent, reviewAgent, that reviews the quiz with you. The project structure reflects this architecture: quiz_app/ ├── agents/ # Base agent + specialist agents ├── tools/ # Tool functions the orchestrator can call ├── utils/ # Foundry client connection ├── data/ ├── quizzes/ # Generated quiz JSON files │── responses/ # User response JSON files └── main.py # Application entry point The orchestrator coordinates three main tools: generate_new_quiz, launch_quiz_interface, and review_quiz_interface. Each tool either creates a specialist agent or launches an interactive interface (Gradio), handling the complexity so the orchestrator can focus on routing and coordination. How Native Function Calling Works When you initialize the orchestrator agent in main.py, you provide two things: tool schemas that describe your functions to the model, and a mapping of function names to actual Python functions. The schemas follow the OpenAI function calling specification, describing each tool's purpose, parameters, and when it should be used. Here's what happens when you send a message to the orchestrator: The agent calls the model with your message and the tool schemas. If the model determines a tool is needed, it returns a structured tool_calls attribute containing the function name and arguments as a proper object—not as text to be parsed. Your code executes the tool, creates a message with "role": "tool" containing the result, and sends everything back to the model. The model can then either call another tool or provide its final response. The critical insight is that the model itself controls this flow through a while loop in the base agent. Each iteration represents the model examining the current state, deciding whether it needs more information, and either proceeding with another tool call or providing its final answer. You're not manually orchestrating when tools get called; the model makes those decisions based on the conversation context. Seeing It In Action Let's walk through a complete session to see how these pieces work together. When you run python main.py, you'll see the application connect to Foundry Local and display a welcome banner: Now type a request like "Generate a 5 question quiz about photosynthesis." Watch what happens in your console: The orchestrator recognized your intent, selected the generate_new_quiz tool, and extracted the topic and number of questions from your natural language request. Behind the scenes, this tool instantiated a QuizGeneratorAgent with a focused system prompt designed specifically for creating quiz JSON. The agent used a low temperature setting to ensure consistent formatting and generated questions that were saved to the data/quizzes folder. This demonstrates the first layer of the multi-agent architecture: the orchestrator doesn't generate quizzes itself. It recognizes that this task requires specialized knowledge about quiz structure and delegates to an agent built specifically for that purpose. Now request to take the quiz by typing "Take the quiz." The orchestrator calls a different tool and Gradio server is launched. Click the link to open in a browser window displaying your quiz questions. This tool demonstrates how function calling can trigger complex interactions—it reads the quiz JSON, dynamically builds a user interface with radio buttons for each question, and handles the submission flow. After you answer the questions and click submit, the interface saves your responses to the data/responses folder and closes the Gradio server. The orchestrator reports completion: The system now has two JSON files: one containing the quiz questions with correct answers, and another containing your responses. This separation of concerns is important—the quiz generation phase doesn't need to know about response collection, and the response collection doesn't need to know how quizzes are created. Each component has a single, well-defined responsibility. Now request a review. The orchestrator calls the third tool: A new chat interface opens, and here's where the multi-agent architecture really shines. The ReviewAgent is instantiated with full context about both the quiz questions and your answers. Its system prompt includes a formatted view of each question, the correct answer, your answer, and whether you got it right. This means when the interface opens, you immediately see personalized feedback: The Multi-Agent Pattern Multi-agent architectures solve complex problems by coordinating specialized agents rather than building monolithic systems. This pattern is particularly powerful for local SLMs. A coordinator agent routes tasks to specialists, each optimized for narrow domains with focused system prompts and specific temperature settings. You can use a 1.7B model for structured data generation, a 7B model for conversations, and a 4B model for reasoning, all orchestrated by a lightweight coordinator. This is more efficient than requiring one massive model for everything. Foundry Local's native function calling makes this straightforward. The coordinator reliably invokes tools that instantiate specialists, with structured responses flowing back through proper tool messages. The model manages the coordination loop—deciding when it needs another specialist, when it has enough information, and when to provide a final answer. In our quiz application, the orchestrator routes user requests but never tries to be an expert in quiz generation, interface design, or tutoring. The QuizGeneratorAgent focuses solely on creating well-structured quiz JSON using constrained prompts and low temperature. The ReviewAgent handles open-ended educational dialogue with embedded quiz context and higher temperature for natural conversation. The tools abstract away file management, interface launching, and agent instantiation, the orchestrator just knows "this tool launches quizzes" without needing implementation details. This pattern scales effortlessly. If you wanted to add a new capability like study guides or flashcards, you could just easily create a new tool or specialists. The orchestrator gains these capabilities automatically by having the tool schemas you have defined without modifying core logic. This same pattern powers production systems with dozens of specialists handling retrieval, reasoning, execution, and monitoring, each excelling in its domain while the coordinator ensures seamless collaboration. Why This Matters The transition from text-parsing to native function calling enables a fundamentally different approach to building AI applications. With text parsing, you're constantly fighting against the unpredictability of natural language output. A model might decide to explain why it's calling a function before outputting the JSON, or it might format the JSON slightly differently than your regex expects, or it might wrap it in markdown code fences. Native function calling eliminates this entire class of problems. The model is trained to output tool calls as structured data, separate from its conversational responses. The multi-agent aspect builds on this foundation. Because function calling is reliable, you can confidently delegate to specialist agents knowing they'll integrate smoothly with the orchestrator. You can chain tool calls—the orchestrator might generate a quiz, then immediately launch the interface to take it, based on a single user request like "Create and give me a quiz about machine learning." The model handles this orchestration intelligently because the tool results flow back as structured data it can reason about. Running everything locally through Foundry Local adds another dimension of value and I am genuinely excited about this (hopefully, the phi models get this functionality soon). You can experiment freely, iterate quickly, and deploy solutions that run entirely on your infrastructure. For educational applications like our quiz system, this means students can interact with the AI tutor as much as they need without cost concerns. Getting Started With Your Own Multi-Agent System The complete code for this quiz application is available in the GitHub repository, and I encourage you to clone it and experiment. Try modifying the tool schemas to see how the orchestrator's behavior changes. Add a new specialist agent for a different task. Adjust the system prompts to see how agent personalities and capabilities shift. Think about the problems you're trying to solve. Could they benefit from having different specialists handling different aspects? A customer service system might have agents for order lookup, refund processing, and product recommendations. A research assistant might have agents for web search, document summarization, and citation formatting. A coding assistant might have agents for code generation, testing, and documentation. Start small, perhaps with two or three specialist agents for a specific domain. Watch how the orchestrator learns to route between them based on the tool descriptions you provide. You'll quickly see opportunities to add more specialists, refine the existing ones, and build increasingly sophisticated systems that leverage the unique strengths of each agent while presenting a unified, intelligent interface to your users. In the next entry, we will be deploying our quizz app which will mark the end of our journey in Foundry and SLMs these past few weeks. I hope you are as excited as I am! Thanks for reading.383Views0likes0CommentsFunction Calling with Small Language Models
In our previous article on running Phi-4 locally, we built a web-enhanced assistant that could search the internet and provide informed answers. Here's what that implementation looked like: def web_enhanced_query(question): # 1. ALWAYS search (hardcoded decision) search_results = search_web(question) # 2. Inject results into prompt prompt = f"""Here are recent search results: {search_results} Question: {question} Using only the information above, give a clear answer.""" # 3. Model just summarizes what it reads return ask_phi4(endpoint, model_id, prompt) Today, we're upgrading to true function calling. With this, we have ability to transform small language models from passive text generators into intelligent agents that can: Decide when to use external tools Reason which tool bests fit each task Execute real-world actions thrugh apis Function calling represents a significant evolution in AI capabilities. Let's understand where this positions our small language models: Agent Classification Framework Simple Reflex Agents (Basic) React to immediate input with predefined rules Example: Thermostat, basic chatbot Without function calling, models operate here Model-Based Agents (Intermediate) Maintain internal state and context Example: Robot vacuum with room mapping Function calling enables this level Goal-Based Agents (Advanced) Plan multi-step sequences to achieve objectives Example: Route planner, task scheduler Function calling + reasoning enables this Learning Agents (Expert) Adapt and improve over time Example: Recommendation systems Future: Function calling + fine-tuning As usual with these articles, let's get ready to get our hands dirty! Project Setup Let's set up our environment for building function-calling assistants. Prerequisites First, ensure you have Foundry Local installed and a model running. We'll use Qwen 2.5-7B for this tutorial as it has excellent function calling support. Important: Not all small language models support function calling equally. Qwen 2.5 was specifically trained for this capability and provides a reliable experience through Foundry Local. # 1. Check Foundry Local is installed foundry --version # 2. Start the Foundry Local service foundry service start # 3. Download and run Qwen 2.5-7B foundry model run qwen2.5-7b Python Environment Setup # 1. Create Python virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # 2. Install dependencies pip install openai requests python-dotenv # 3. Get a free OpenWeatherMap API key # Sign up at: https://openweathermap.org/api ``` Create `.env` file: ``` OPENWEATHER_API_KEY=your_api_key_here ``` Building a Weather-Aware Assistant So in this scenario, a user wants to plan outdoor activities but needs weather context. Without function calling, You will get something like this: User: "Should I schedule my team lunch outside at 2pm in Birmingham?" Model: "That depends on weather conditions. Please check the forecast for rain and temperature." However, with fucntion-calling you get an answer that is able to look up the weather and reply with the needed context. We will do that now. Understanding Foundry Local's Function Calling Implementation Before we start coding, there's an important implementation detail to understand. Foundry Local uses a non-standard function calling format. Instead of returning function calls in the standard OpenAI tool_calls field, Qwen models return the function call as JSON text in the response content. For example, when you ask about weather, instead of: # Standard OpenAI format message.tool_calls = [ {"name": "get_weather", "arguments": {"location": "Birmingham"}} ] You get: # Foundry Local format message.content = '{"name": "get_weather", "arguments": {"location": "Birmingham"}}' This means we need to parse the JSON from the content ourselves. Don't worry—this is straightforward, and I'll show you exactly how to handle it! Step 1: Define the Weather Tool Create weather_assistant.py: import os from openai import OpenAI import requests import json import re from dotenv import load_dotenv load_dotenv() # Initialize Foundry Local client client = OpenAI( base_url="http://127.0.0.1:59752/v1/", api_key="not-needed" ) # Define weather tool tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather information for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city or location name" }, "units": { "type": "string", "description": "Temperature units", "enum": ["celsius", "fahrenheit"], "default": "celsius" } }, "required": ["location"] } } } ] A tool is necessary because it provides the model with a structured specification of what external functions are available and how to use them. The tool definition contains the function name, description, parameters schema, and information returned. Step 2: Implement the Weather Function def get_weather(location: str, units: str = "celsius") -> dict: """Fetch weather data from OpenWeatherMap API""" api_key = os.getenv("OPENWEATHER_API_KEY") url = "http://api.openweathermap.org/data/2.5/weather" params = { "q": location, "appid": api_key, "units": "metric" if units == "celsius" else "imperial" } response = requests.get(url, params=params, timeout=5) response.raise_for_status() data = response.json() temp_unit = "°C" if units == "celsius" else "°F" return { "location": data["name"], "temperature": f"{round(data['main']['temp'])}{temp_unit}", "feels_like": f"{round(data['main']['feels_like'])}{temp_unit}", "conditions": data["weather"][0]["description"], "humidity": f"{data['main']['humidity']}%", "wind_speed": f"{round(data['wind']['speed'] * 3.6)} km/h" } The model calls this function to get the weather data. it contacts OpenWeatherMap API, gets real weather data and returns it as a python dictionary Step 3: Parse Function Calls from Content This is the crucial step where we handle Foundry Local's non-standard format: def parse_function_call(content: str): """Extract function call JSON from model response""" if not content: return None json_pattern = r'\{"name":\s*"get_weather",\s*"arguments":\s*\{[^}]+\}\}' match = re.search(json_pattern, content) if match: try: return json.loads(match.group()) except json.JSONDecodeError: pass try: parsed = json.loads(content.strip()) if isinstance(parsed, dict) and "name" in parsed: return parsed except json.JSONDecodeError: pass return None Step 4: Main Chat Function with Function Calling and lastly, calling the model. Notice the tools and tool_choice parameter. Tools tells the model it is allowed to output a tool_call requesting that the function be executed. While tool_choice instructs the model how to decide whether to call a tool. def chat(user_message: str) -> str: """Process user message with function calling support""" messages = [ {"role": "user", "content": user_message} ] response = client.chat.completions.create( model="qwen2.5-7b-instruct-generic-cpu:4", messages=messages, tools=tools, tool_choice="auto", temperature=0.3, max_tokens=500 ) message = response.choices[0].message if message.content: function_call = parse_function_call(message.content) if function_call and function_call.get("name") == "get_weather": print(f"\n[Function Call] {function_call.get('name')}({function_call.get('arguments')})") args = function_call.get("arguments", {}) weather_data = get_weather(**args) print(f"[Result] {weather_data}\n") final_prompt = f"""User asked: "{user_message}" Weather data: {json.dumps(weather_data, indent=2)} Provide a natural response based on this weather information.""" final_response = client.chat.completions.create( model="qwen2.5-7b-instruct-generic-cpu:4", messages=[{"role": "user", "content": final_prompt}], max_tokens=200, temperature=0.7 ) return final_response.choices[0].message.content return message.content Step 5: Run the script Now put all the above together and run the script def main(): """Interactive weather assistant""" print("\nWeather Assistant") print("=" * 50) print("Ask about weather or general questions.") print("Type 'exit' to quit\n") while True: user_input = input("You: ").strip() if user_input.lower() in ['exit', 'quit']: print("\nGoodbye!") break if user_input: response = chat(user_input) print(f"Assistant: {response}\n") if __name__ == "__main__": if not os.getenv("OPENWEATHER_API_KEY"): print("Error: OPENWEATHER_API_KEY not set") print("Set it with: export OPENWEATHER_API_KEY='your_key_here'") exit(1) main() Note: Make sure Qwen 2.5 is running in Foundry Local in a new terminal Now let's talk about Model Context Protocol! Our weather assistant works beautifully with a single function, but what happens when you need dozens of tools? Database queries, file operations, calendar integration, email—each would require similar setup code. This is where Model Context Protocol (MCP) comes in. MCP is an open standard that provides pre-built, standardized servers for common tools. Instead of writing custom integration code for every capability, you can connect to MCP servers that handle the complexity for you. With MCP, You only need one command to enable weather, database, and file access npx @modelcontextprotocol/server-weather npx @modelcontextprotocol/server-sqlite npx @modelcontextprotocol/server-filesystem Your model automatically discovers and uses these tools without custom integration code. Learn more: Model Context Protocol Documentation EdgeAI Course - Module 03: MCP Integration Key Takeaways Function calling transforms models into agents - From passive text generators to active problem-solvers Qwen 2.5 has excellent function calling support - Specifically trained for reliable tool use Foundry Local uses non-standard format - Parse JSON from content instead of tool_calls field Start simple, then scale with MCP - Build one tool to understand the pattern, then leverage standards Documentation Running Phi-4 Locally with Foundry Local Phi-4: Small Language Models That Pack a Punch Microsoft Foundry Local GitHub EdgeAI for Beginners Course OpenWeatherMap API Documentation Model Context Protocol Qwen 2.5 Documentation Thank you for reading! I hope this article helps you build more capable AI agents with small language models. Function calling opens up incredible possibilities—from simple weather assistants to complex multi-tool workflows. Start with one tool, understand the pattern, and scale from there.703Views1like0CommentsRunning Phi-4 Locally with Microsoft Foundry Local: A Step-by-Step Guide
In our previous post, we explored how Phi-4 represents a new frontier in AI efficiency that delivers performance comparable to models 5x its size while being small enough to run on your laptop. Today, we're taking the next step: getting Phi-4 up and running locally on your machine using Microsoft Foundry Local. Whether you're a developer building AI-powered applications, an educator exploring AI capabilities, or simply curious about running state-of-the-art models without relying on cloud APIs, this guide will walk you through the entire process. Microsoft Foundry Local brings the power of Azure AI Foundry to your local device without requiring an Azure subscription, making local AI development more accessible than ever. So why do you want to run Phi-4 Locally? Before we dive into the setup, let's quickly recap why running models locally matters: Privacy and Control: Your data never leaves your machine. This is crucial for sensitive applications in healthcare, finance, or education where data privacy is paramount. Cost Efficiency: No API costs, no rate limits. Once you have the model downloaded, inference is completely free. Speed and Reliability: No network latency or dependency on external services. Your AI applications work even when you're offline. Learning and Experimentation: Full control over model parameters, prompts, and fine-tuning opportunities without restrictions. With Phi-4's compact size, these benefits are now accessible to anyone with a modern laptop—no expensive GPU required. What You'll Need Before we begin, make sure you have: Operating System: Windows 10/11, macOS (Intel or Apple Silicon), or Linux RAM: Minimum 16GB (32GB recommended for optimal performance) Storage: At least 5 - 10GB of free disk space Processor: Any modern CPU (GPU optional but provides faster inference) Note: Phi-4 works remarkably well even on consumer hardware 😀. Step 1: Installing Microsoft Foundry Local Microsoft Foundry Local is designed to make running AI models locally as simple as possible. It handles model downloads, manages memory efficiently, provides OpenAI-compatible APIs, and automatically optimizes for your hardware. For Windows Users: Open PowerShell or Command Prompt and run: winget install Microsoft.FoundryLocal For macOS Users (Apple Silicon): Open Terminal and run: brew install microsoft/foundrylocal/foundrylocal Verify Installation: Open your terminal and type. This should return the Microsoft Foundry Local version, confirming installation: foundry --version Step 2: Downloading Phi-4-Mini For this tutorial, we'll use Phi-4-mini, the lightweight 3.8 billion parameter version that's perfect for learning and experimentation. Open your terminal and run: foundry model run phi-4-mini You should see your download begin and something similar to the image below Available Phi Models on Foundry Local While we're using phi-4-mini for this guide, Foundry Local offers several Phi model variants and other open-source models optimized for different hardware and use cases: Model Hardware Type Size Best For phi-4-mini GPU chat-completion 3.72 GB Learning, fast responses, resource-constrained environments with GPU phi-4-mini CPU chat-completion 4.80 GB Learning, fast responses, CPU-only systems phi-4-mini-reasoning GPU chat-completion 3.15 GB Reasoning tasks with GPU acceleration phi-4-mini-reasoning CPU chat-completion 4.52 GB Mathematical proofs, logic puzzles with lower resource requirements phi-4 GPU chat-completion 8.37 GB Maximum reasoning performance, complex tasks with GPU phi-4 CPU chat-completion 10.16 GB Maximum reasoning performance, CPU-only systems phi-3.5-mini GPU chat-completion 2.16 GB Most lightweight option with GPU support phi-3.5-mini CPU chat-completion 2.53 GB Most lightweight option, CPU-optimized phi-3-mini-128k GPU chat-completion 2.13 GB Extended context (128k tokens), GPU-optimized phi-3-mini-128k CPU chat-completion 2.54 GB Extended context (128k tokens), CPU-optimized phi-3-mini-4k GPU chat-completion 2.13 GB Standard context (4k tokens), GPU-optimized phi-3-mini-4k CPU chat-completion 2.53 GB Standard context (4k tokens), CPU-optimized Note: Foundry Local automatically selects the best variant for your hardware. If you have an NVIDIA GPU, it will use the GPU-optimized version. Otherwise, it will use the CPU-optimized version. run the command below to see full list of models foundry model list Step 3: Test It Out Once the download completes, an interactive session will begin. Let's test Phi-4-mini's capabilities with a few different prompts: Example 1: Explanation Phi-4-mini provides a thorough, well-structured explanation! It starts with the basic definition, explains the process in biological systems, gives real-world examples (plant cells, human blood cells). The response is detailed yet accessible. Example 2: Mathematical Problem Solving Excellent step-by-step solution! Phi-4-mini breaks down the problem methodically: 1. Distributes on the left side 2. Isolates the variable terms 3. Simplifies progressively 4. Arrives at the final answer: x = 11 The model shows its work clearly, making it easy to follow the logic and ideal for educational purposes Example 3: Code Generation The model provides a concise Python function using string slicing ([::-1]) - the most Pythonic approach to reversing a string. It includes clear documentation with a docstring explaining the function's purpose, provides example usage demonstrating the output, and even explains how the slicing notation works under the hood. The response shows that the model understands not just how to write the code, but why this approach is preferred - noting that the [::-1] slice notation means "start at the end of the string and end at position 0, move with the step -1, negative one, which means one step backwards." This showcases the model's ability to generate production-ready code with proper documentation while being educational about Python idioms. To exit the interactive session, type `/bye` Step 4: Extending Phi-4 with Real-Time Tools Understanding Phi-4's Knowledge Cutoff Like all language models, Phi-4 has a knowledge cutoff date from its training data (typically several months old). This means it won't know about very recent events, current prices, or breaking news. For example, if you ask "Who won the 2024 NBA championship?" it might not have the answer. The good thing is, there's a powerful work-around. While Phi-4 is incredibly capable, connecting it to external tools like web search, databases, or APIs transforms it from a static knowledge base into a dynamic reasoning engine. This is where Microsoft Foundry's REST API comes in. Microsoft Foundry provides a simple API that lets you integrate Phi-4 into Python applications and connect it to real-time data sources. Here's a practical example: building a web-enhanced AI assistant. Web-Enhanced AI Assistant This simple application combines Phi-4's reasoning with real-time web search, allowing it to answer current questions accurately. Prerequisites: pip install foundry-local-sdk requests ddgs Create phi4_web_assistant.py: import requests from foundry_local import FoundryLocalManager from ddgs import DDGS import json def search_web(query): """Search the web and return top results""" try: results = list(DDGS().text(query, max_results=3)) if not results: return "No search results found." search_summary = "\n\n".join([ f"[Source {i+1}] {r['title']}\n{r['body'][:500]}" for i, r in enumerate(results) ]) return search_summary except Exception as e: return f"Search failed: {e}" def ask_phi4(endpoint, model_id, prompt): """Send a prompt to Phi-4 and stream response""" response = requests.post( f"{endpoint}/chat/completions", json={ "model": model_id, "messages": [{"role": "user", "content": prompt}], "stream": True }, stream=True, timeout=180 ) full_response = "" for line in response.iter_lines(): if line: line_text = line.decode('utf-8') if line_text.startswith('data: '): line_text = line_text[6:] # Remove 'data: ' prefix if line_text.strip() == '[DONE]': break try: data = json.loads(line_text) if 'choices' in data and len(data['choices']) > 0: delta = data['choices'][0].get('delta', {}) if 'content' in delta: chunk = delta['content'] print(chunk, end="", flush=True) full_response += chunk except json.JSONDecodeError: continue print() return full_response def web_enhanced_query(question): """Combine web search with Phi-4 reasoning""" # By using an alias, the most suitable model will be downloaded # to your device automatically alias = "phi-4-mini" # Create a FoundryLocalManager instance. This will start the Foundry # Local service if it is not already running and load the specified model. manager = FoundryLocalManager(alias) model_info = manager.get_model_info(alias) print("🔍 Searching the web...\n") search_results = search_web(question) prompt = f"""Here are recent search results: {search_results} Question: {question} Using only the information above, give a clear answer with specific details.""" print("🤖 Phi-4 Answer:\n") return ask_phi4(manager.endpoint, model_info.id, prompt) if __name__ == "__main__": # Try different questions question = "Who won the 2024 NBA championship?" # question = "What is the latest iPhone model released in 2024?" # question = "What is the current price of Bitcoin?" print(f"Question: {question}\n") print("=" * 60 + "\n") web_enhanced_query(question) print("\n" + "=" * 60) Run It: python phi4_web_assistant.py What Makes This Powerful By connecting Phi-4 to external tools, you create an intelligent system that: Accesses Real-Time Information: Get news, weather, sports scores, and breaking developments Verifies Facts: Cross-reference information with multiple sources Extends Capabilities: Connect to databases, APIs, file systems, or any other tool Enables Complex Applications: Build research assistants, customer support bots, educational tutors, and personal assistants This same pattern can be applied to connect Phi-4 to: Databases: Query your company's internal data APIs: Weather services, stock prices, translation services File Systems: Analyze documents and spreadsheets IoT Devices: Control smart home systems The possibilities are endless when you combine local AI reasoning with real-world data access. Troubleshooting Common Issues Service not running: Make sure Foundry Local is properly installed and the service is running. Try restarting with foundry --version to verify installation. Model downloads slowly: Check your internet connection and ensure you have enough disk space (5-10GB per model). Out of memory: Close other applications or try using a smaller model variant like phi-3.5-mini instead of the full phi-4. Connection issues: Verify that no other services are using the same ports. Foundry Local typically runs on http://localhost:5272. Model not found: Run foundry model list to see available models, then use foundry model run <model-name> to download and run a specific model. Your Next Steps with Foundry Local Congratulations! You now have Phi-4 running locally through Microsoft Foundry Local and understand how to extend it with external tools like web search. This combination of local AI reasoning with real-time data access opens up countless possibilities for building intelligent applications. Coming in Future Posts In the coming weeks, we'll explore advanced topics using Hugging Face: Fine-tuning Phi models on your own data for domain-specific applications Phi-4-multimodal: Analyze images, process audio, and combine multiple data types Advanced deployment patterns: RAG systems and multi-agent orchestration Resources to Explore EdgeAI for Beginners Course: Comprehensive 36-45 hour course covering Edge AI fundamentals, optimization, and production deployment Phi-4 Technical Report: Deep dive into architecture and benchmarks Phi Cookbook on GitHub: Practical examples and recipes Foundry Local Documentation: Complete technical documentation and API reference Module 08: Foundry Local Toolkit: 10 comprehensive samples including RAG applications and multi-agent systems Keep experimenting with Foundry Local, and stay tuned as we unlock the full potential of Edge AI! What will you build with Phi-4? Share your ideas and projects in the comments below!2.6KViews1like1CommentAI Sparks: Unleashing Agents with the AI Toolkit
The final episode of our "AI Sparks" series delved deep into the exciting world of AI Agents and their practical implementation. We also covered a fair part of MCP with Microsoft AI Toolkit extension for VS Code. We kicked off by charting the evolutionary path of intelligent conversational systems. Starting with the rudimentary rule-based Basic Chatbots, we then explored the advancements brought by Basic Generative AI Chatbots, which offered contextually aware interactions. Then we explored the Retrieval-Augmented Generation (RAG), highlighting its ability to ground generative models in specific knowledge bases, significantly enhancing accuracy and relevance. The limitations were also discussed for the above mentioned techniques. The session was then centralized to the theme – Agents and Agentic Frameworks. We uncovered the fundamental shift from basic chatbots to autonomous agents capable of planning, decision-making, and executing tasks. We moved forward with detailed discussion on the distinction between Single Agentic systems, where one core agent orchestrates the process, and Multi-Agent Architectures, where multiple specialized agents collaborate to achieve complex goals. A key part of building robust and reliable AI Agents, as we discussed, revolves around carefully considering four critical factors. Firstly, Knowledge-Providing agents with the right context is paramount for them to operate effectively and make informed decisions. Secondly, equipping agents with the necessary Actions by granting them access to the appropriate tools allows them to execute tasks and achieve desired outcomes. Thirdly, Security is non-negotiable; ensuring agents have access only to the data and services they genuinely need is crucial for maintaining privacy and preventing unintended actions. Finally, establishing robust Evaluations mechanisms is essential to verify that agents are completing tasks correctly and meeting the required standards. These four pillars – Knowledge, Actions, Security, and Evaluation – form the bedrock of any successful agentic implementation. To illustrate the transformative power of AI Agents, we explored several interesting use cases and applications. These ranged from intelligent personal assistants capable of managing schedules and automating workflows to sophisticated problem-solving systems in domains like customer service. A significant portion of the session was dedicated to practical implementation through demonstrations. We highlighted key frameworks that are empowering developers to build agentic systems.: Semantic Kernel: We highlighted its modularity and rich set of features for integrating various AI services and tools. Autogen Studio: The focus here was on its capabilities for facilitating the creation and management of multi-agent conversations and workflows. Agent Service: We discussed its role in providing a more streamlined and managed environment for deploying and scaling AI agents. The major point of attraction was that these were demonstrated using the local LLMs which were hosted using AI Toolkit. This showcased the ease with which developers can utilize VS Code AI toolkit to build and experiment with agentic workflows directly within their familiar development environment. Finally, we demystified the concept of Model Context Protocol (MCP) and demonstrated how seamlessly it can be implemented using the Agent Builder within the VS Code AI Toolkit. We demonstrated this with a basic Website development using MCP. This practical demonstration underscored the toolkit's power in simplifying the development of complex solutions that can maintain context and engage in more natural, multi-step interactions. The "AI Sparks" series concluded with a discussion, where attendees had a clearer understanding of the evolution, potential and practicalities of AI Agents. The session underscored that we are on the cusp of a new era of intelligent systems that are not just reactive but actively work alongside us to achieve goals. The tools and frameworks are maturing, and the possibilities for agentic applications are sparking innovation across various industries. It was an exciting journey, and engagement during the final session on AI Sparks around Agents truly highlighted the transformative potential of this field. "AI Sparks" Series Roadmap: The "AI Sparks" series delved deeper into specific topics using AI Toolkit for Visual Studio Code, including: Introduction to AI toolkit and feature walkthrough: Introduction to the AI Toolkit extension for VS Code a powerful way to explore and integrate the latest AI models from OpenAI, Meta, Deepseek, Mistral, and more. Introduction to SLMs and local model with use cases: Explore Small Language Models (SLMs) and how they compare to larger models. Building RAG Applications: Create powerful applications that combine the strengths of LLMs with external knowledge sources. Multimodal Support and Image Analysis: Working with vision models and building multimodal applications. Evaluation and Model Selection: Evaluate model performance and choose the best model for your needs. Agents and Agentic Frameworks: Exploring the cutting edge of AI agents and how they can be used to build more complex and autonomous systems. The full playlist of the series with all the episodes of "AI Sparks" is available at AI Sparks Playlist. Continue the discussion and questions in Microsoft AI Discord Community where we have a dedicated AI-sparks channel. All the code samples can be found on AI_Toolkit_Samples .We look forward to continuing these insightful discussions in future series!379Views2likes0CommentsHow to use any Python AI agent framework with free GitHub Models
I ❤️ when companies offer free tiers for developer services, since it gives everyone a way to learn new technologies without breaking the bank. Free tiers are especially important for students and people between jobs, when the desire to learn is high but the available cash is low. That's why I'm such a fan of GitHub Models: free, high-quality generative AI models available to anyone with a GitHub account. The available models include the latest OpenAI LLMs (like o3-mini), LLMs from the research community (like Phi and Llama), LLMs from other popular providers (like Mistral and Jamba), multimodal models (like gpt-4o and llama-vision-instruct) and even a few embedding models (from OpenAI and Cohere). With access to such a range of models, you can prototype complex multi-model workflows to improve your productivity or heck, just make something fun for yourself. 🤗 To use GitHub Models, you can start off in no-code mode: open the playground for a model, send a few requests, tweak the parameters, and check out the answers. When you're ready to write code, select "Use this model". A screen will pop up where you can select a programming language (Python/JavaScript/C#/Java/REST) and select an SDK (which varies depending on model). Then you'll get instructions and code for that model, language, and SDK. But here's what's really cool about GitHub Models: you can use them with all the popular Python AI frameworks, even if the framework has no specific integration with GitHub Models. How is that possible? The vast majority of Python AI frameworks support the OpenAI Chat Completions API, since that API became a defacto standard supported by many LLM API providers besides OpenAI itself. GitHub Models also provide OpenAI-compatible endpoints for chat completion models. Therefore, any Python AI framework that supports OpenAI-like models can be used with GitHub Models as well. 🎉 To prove it, I've made a new repository with examples from eight different Python AI agent packages, all working with GitHub Models: python-ai-agent-frameworks-demos. There are examples for AutoGen, LangGraph, Llamaindex, OpenAI Agents SDK, OpenAI standard SDK, PydanticAI, Semantic Kernel, and SmolAgents. You can open that repository in GitHub Codespaces, install the packages, and get the examples running immediately. Now let's walk through the API connection code for GitHub Models for each framework. Even if I missed your favorite framework, I hope my tips here will help you connect any framework to GitHub Models. OpenAI I'll start with openai , the package that started it all! import openai client = openai.OpenAI( api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com") The code above demonstrates the two key parameters we'll need to configure for all frameworks: api_key : When using OpenAI.com, you pass your OpenAI API key here. When using GitHub Models, you pass in a Personal Access Token (PAT). If you open the repository (or any repository) in GitHub Codespaces, a PAT is already stored in the GITHUB_TOKEN environment variable. However, if you're working locally with GitHub Models, you'll need to generate a PAT yourself and store it. PATs expire after a while, so you need to generate new PATs every so often. base_url : This parameter tells the OpenAI client to send all requests to "https://models.inference.ai.azure.com" instead of the OpenAI.com API servers. That's the domain that hosts the OpenAI-compatible endpoint for GitHub Models, so you'll always pass that domain as the base URL. If we're working with the new openai-agents SDK, we use very similar code, but we must use the AsyncOpenAI client from openai instead. Lately, Python AI packages are defaulting to async, because it's so much better for performance. import agents import openai client = openai.AsyncOpenAI( base_url="https://models.inference.ai.azure.com", api_key=os.environ["GITHUB_TOKEN"]) model = agents.OpenAIChatCompletionsModel( model="gpt-4o", openai_client=client) spanish_agent = agents.Agent( name="Spanish agent", instructions="You only speak Spanish.", model=model) PydanticAI Now let's look at all of the packages that make it really easy for us, by allowing us to directly bring in an instance of either OpenAI or AsyncOpenAI . For PydanticAI, we configure an AsyncOpenAI client, then construct an OpenAIModel object from PydanticAI, and pass that model to the agent: import openai import pydantic_ai import pydantic_ai.models.openai client = openai.AsyncOpenAI( api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com") model = pydantic_ai.models.openai.OpenAIModel( "gpt-4o", provider=OpenAIProvider(openai_client=client)) spanish_agent = pydantic_ai.Agent( model, system_prompt="You only speak Spanish.") Semantic Kernel For Semantic Kernel, the code is very similar. We configure an AsyncOpenAI client, then construct an OpenAIChatCompletion object from Semantic Kernel, and add that object to the kernel. import openai import semantic_kernel.connectors.ai.open_ai import semantic_kernel.agents chat_client = openai.AsyncOpenAI( api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com") chat = semantic_kernel.connectors.ai.open_ai.OpenAIChatCompletion( ai_model_id="gpt-4o", async_client=chat_client) kernel.add_service(chat) spanish_agent = semantic_kernel.agents.ChatCompletionAgent( kernel=kernel, name="Spanish agent" instructions="You only speak Spanish") AutoGen Next, we'll check out a few frameworks that have their own wrapper of the OpenAI clients, so we won't be using any classes from openai directly. For AutoGen, we configure both the OpenAI parameters and the model name in the same object, then pass that to each agent: import autogen_ext.models.openai import autogen_agentchat.agents client = autogen_ext.models.openai.OpenAIChatCompletionClient( model="gpt-4o", api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com") spanish_agent = autogen_agentchat.agents.AssistantAgent( "spanish_agent", model_client=client, system_message="You only speak Spanish") LangGraph For LangGraph, we configure a very similar object, which even has the same parameter names: import langchain_openai import langgraph.graph model = langchain_openai.ChatOpenAI( model="gpt-4o", api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com", ) def call_model(state): messages = state["messages"] response = model.invoke(messages) return {"messages": [response]} workflow = langgraph.graph.StateGraph(MessagesState) workflow.add_node("agent", call_model) SmolAgents Once again, for SmolAgents, we configure a similar object, though with slightly different parameter names: import smolagents model = smolagents.OpenAIServerModel( model_id="gpt-4o", api_key=os.environ["GITHUB_TOKEN"], api_base="https://models.inference.ai.azure.com") agent = smolagents.CodeAgent(model=model) Llamaindex I saved Llamaindex for last, as it is the most different. The llama-index package has a different constructor for OpenAI.com versus OpenAI-like servers, so I opted to use that OpenAILike constructor instead. However, I also needed an embeddings model for my example, and the package doesn't have an OpenAIEmbeddingsLike constructor, so I used the standard OpenAIEmbedding constructor. import llama_index.embeddings.openai import llama_index.llms.openai_like import llama_index.core.agent.workflow Settings.llm = llama_index.llms.openai_like.OpenAILike( model="gpt-4o", api_key=os.environ["GITHUB_TOKEN"], api_base="https://models.inference.ai.azure.com", is_chat_model=True) Settings.embed_model = llama_index.embeddings.openai.OpenAIEmbedding( model="text-embedding-3-small", api_key=os.environ["GITHUB_TOKEN"], api_base="https://models.inference.ai.azure.com") agent = llama_index.core.agent.workflow.ReActAgent( tools=query_engine_tools, llm=Settings.llm) Choose your models wisely! In all of the examples above, I specified the gpt-4o model. The gpt-4o model is a great choice for agents because it supports function calling, and many agent frameworks only work (or work best) with models that natively support function calling. Fortunately, GitHub Models includes multiple models that support function calling, at least in my basic experiments: gpt-4o gpt-4o-mini o3-mini AI21-Jamba-1.5-Large AI21-Jamba-1.5-Mini Codestral-2501 Cohere-command-r Ministral-3B Mistral-Large-2411 Mistral-Nemo Mistral-small You might find that some models work better than others, especially if you're using agents with multiple tools. With GitHub Models, it's very easy to experiment and see for yourself, by simply changing the model name and re-running the code. Join the AI Agents Hackathon We are currently running a free virtual hackathon from April 8th - 30th, to challenge developers to create agentic applications using Microsoft technologies. You could build an agent entirely using GitHub Models and submit it to the hackathon for a chance to win amazing prizes! You can also join our 30+ streams about building AI agents, including a stream all about prototyping with GitHub Models. Learn more and register at https://aka.ms/agentshack2.7KViews3likes0Comments