cache

2 Topics

KVStream: Smarter Memory Management for On-Device Language Model Inference
On-Device LLMs Are Starved for Memory Intelligence The shift toward on-device language model inference is accelerating. Platforms like Microsoft's Foundry Local, Ollama, and llama.cpp now make it possible to run capable models such as Phi-3-mini and Llama 3 entirely on local hardware, no cloud dependency, no data leaving the device. This is a meaningful leap for privacy-sensitive applications, offline scenarios, and latency-critical workloads. But on-device/edge inference surfaces a class of performance problems that cloud-based serving largely abstracts away. Memory is the bottleneck. When a language model generates text, it builds a Key-Value (KV) cache a memory structure that stores intermediate attention states so it doesn't have to recompute them for every new token. On a cloud server with hundreds of gigabytes of HBM (Hardware Bandwidth Memory), wasting KV cache memory is a rounding error. On a developer workstation or a GPU/NPU-equipped edge device, it is the difference between a usable system and a stalled one. The typical on-device runtime allocates KV cache memory statically and per-sequence. This creates three concrete problems: Fragmentation and over-reservation. A model has no way of knowing in advance how long a response will be. So, runtimes allocate a worst-case maximum context window of memory for every request even if the actual response is short. The result is a fragmented memory pool where most of the reserved space sits idle while queued requests wait for a slot to open. Research confirms the scale of this waste: production settings with variable-length or concurrent requests typically discard 60–80% of KV memory under monolithic static allocation; paged designs reduce this to under 5%. No batching across requests. Most local runtimes process one request at a time. If you fire four concurrent questions at Foundry Local or Ollama without a batching layer, they queue sequentially. Throughput scales linearly with latency instead of benefiting from the parallel processing the underlying hardware supports. Continuous batching the technique of admitting new sequences mid-generation rather than waiting for the whole batch to drain has been shown to yield up to 36.9× throughput improvement in the original Orca study, with production deployments regularly achieving 2–5× over static batching. Redundant computation. Real-world applications consistently include a shared system prompt instructions that tell the model how to behave. With a RAG pipeline or a chat assistant, the same few hundred tokens get prefilled on every single request, burning compute and TTFT (time-to-first-token) unnecessarily. External prefix caching layers addressing this problem have demonstrated up to 15× throughput improvement on multi-round workloads. These are not hypothetical inefficiencies. In practice, they translate to sluggish multi-turn conversation, poor throughput when multiple users or agent loops share the same local model, and wasted hardware potential on capable machines that could be doing much more. Crucially, while some of the cloud-scale inference engines embed these techniques internally, there is no equivalent orchestration layer for on-device runtimes a gap that enterprise deployments and the research community are now actively calling. Introducing KVStream: A Middleware Layer for Local LLM Runtimes KVStream is a lightweight Python middleware that sits between the application and any local LLM runtime, adding production-grade memory management and scheduling without requiring you to modify the backend or the client. The design principle is deliberate: KVStream is not a new inference engine. It does not replace Foundry Local or any other runtime. Instead, it solves the orchestration layer that on-device runtimes currently leave unaddressed the gap between a single model server and an application that expects the reliability and throughput of a managed serving system. KVStream exposes a fully OpenAI-compatible API on http://localhost:8080/v1. Any existing client the openai Python SDK, LangChain, httpx, or a curl command connects to it without modification. The backend continues to run unchanged. Architecture KVStream is composed of four cooperating subsystems. Understanding how they interact clarifies why the gains are meaningful and composable. 1.Paged KV-Cache Allocator: Inspired by the paged attention mechanism, KVStream manages KV cache memory as a pool of fixed-size pages (or blocks), each holding a configurable number of token states (default: 16 tokens per block). Rather than reserving a contiguous worst-case buffer per sequence, the allocator assigns pages on demand and reclaims them when a sequence finishes. Each sequence owns a logical page table, a mapping from logical page indices to physical block slots. Pages can be shared across sequences (enabling prefix deduplication) and migrated between GPU and CPU (enabling pre-emption). For Foundry Local and Ollama, this operates in soft-inject mode, the page table controls admission and logical accounting, while the actual KV tensors stay inside the backend. For llama.cpp, which exposes a /slots API for saving and restoring raw KV state, KVStream can operate in hard-inject mode, it manages a real tensor pool and performs zero re-compute cache reuse by physically restoring KV state between requests. 2.Continuous Batching Scheduler: The scheduler merges multiple queued sequences into a single batched forward pass, up to a configurable max_batch_size. Critically, it uses continuous batching new sequences can be admitted mid-generation, filling slots vacated by completed requests rather than waiting for the entire batch to finish. Two scheduling priorities are supported: fcfs (first-come, first-served): straightforward FIFO, best for fairness. sjf (shortest-job-first): minimizes average TTFT by prioritizing requests with shorter expected output lengths. When GPU page blocks are exhausted and a new sequence must be admitted, the scheduler applies a preemption policy: swap: the lowest-priority active sequence's pages are migrated to CPU RAM. The sequence resumes when GPU blocks become available again. recompute: pages are freed and the sequence is re-queued from scratch, lower memory overhead, higher latency for the preempted request. 3.Prefix Cache: The prefix cache deduplicates the KV computation for any shared token prefix across requests like system prompts, few-shot examples, RAG preambles, or any stable instruction block. The mechanism works in three steps: After a request completes its prefill phase, KVStream hashes the prompt tokens in block-aligned chunks and stores the canonical block table for that prefix. On a subsequent request whose prompt starts with the same prefix, the new sequence forks the canonical block table via copy-on-write, no re-computation occurs. Entries expire after a configurable TTL (default: 1 hour), or can be evicted manually. The practical consequence: in any application with a stable system prompt, only the first request pays the prefill cost for that prompt. Every subsequent request skips it entirely, reducing TTFT in proportion to the length of the shared prefix. 3.OpenAI-Compatible Proxy KVStream wraps all of the above behind a standard /v1/chat/completions interface. It handles both streaming (SSE) and non-streaming responses, translates between the OpenAI request format and the backend's native API, and serves a /health, /status, and /metrics endpoint for observability. The /status endpoint returns live scheduler and memory state: { "scheduler": { "waiting": 2, "running": 8, "swapped": 0, "gpu_blocks_free": 184, "gpu_utilization": 0.281 }, "prefix_cache": { "cached_prefixes": 3, "total_prefix_hits": 12, "cached_tokens": 768 } } A Prometheus-compatible /metrics endpoint is also available, with a pre-configured Grafana dashboard included in the Docker Compose stack. Getting Started with Foundry Local: Microsoft's Foundry Local is the primary integration target for KVStream. Foundry Local provides a high-quality on-device inference runtime with strong model support (Phi-3-mini, Phi-3.5, and others), NPU acceleration, and direct integration with the Windows AI platform. KVStream's Foundry backend adds continuous batching and prefix caching on top of that foundation without any changes to the Foundry runtime itself. Installation: pip install kvstream Start the KVStream Proxy: # 1. Start Foundry Local (if not already running) foundrylocal serve # 2. Start KVStream in front of it kvstream serve --backend foundry --model phi-3-mini --port 8080 KVStream's Foundry backend includes auto-discovery, if Foundry Local assigns an ephemeral OS port (which it typically does), KVStream scans localhost to locate the active service automatically, so you do not need to hardcode the backend port. Connect your existing client — unchanged: from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-required", ) response = client.chat.completions.create( model="phi-3-mini", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain paged attention in simple terms."}, ], max_tokens=512, ) print(response.choices[0].message.content) Any client that speaks the OpenAI protocol like LangChain, httpx, or a raw HTTP call, connects here without modification. Maximize prefix cache hits: Put the system prompt first; KVStream caches it after the first request and skips its computation for every subsequent one. import asyncio from openai import AsyncOpenAI client = AsyncOpenAI(base_url="http://localhost:8080/v1", api_key="not-required") SYSTEM = "You are an expert assistant specialized in on-device AI." async def ask(question: str) -> str: r = await client.chat.completions.create( model="phi-3-mini", messages=[ {"role": "system", "content": SYSTEM}, # cached after first call {"role": "user", "content": question}, ], max_tokens=256, ) return r.choices[0].message.content async def main(): questions = [ "What is NPU acceleration?", "How does Phi-3-mini differ from larger models?", "What is token streaming?", ] # KVStream batches these automatically; system prompt computed once answers = await asyncio.gather(*[ask(q) for q in questions]) for q, a in zip(questions, answers): print(f"Q: {q}\nA: {a}\n") asyncio.run(main()) Tuning memory for hardware: The single most impactful configuration knob is the GPU block pool size. For soft-inject mode (Foundry Local, Ollama), this controls admission concurrency, not actual VRAM allocation by KVStream. A practical starting table is as follows, Device VRAM / RAM num_gpu_blocks Suitable models (E.g.,) 4 GB 64 phi-3-mini 8 GB 128 llama3-8b, mistral-7b 16 GB 256 llama3-8b (q8) 24 GB 512 llama3-70b (q4), mixtral Via YAML configuration: # kvstream.yaml backend: type: foundry model: phi-3-mini memory: num_gpu_blocks: 128 num_cpu_blocks: 256 block_size: 16 scheduler: max_batch_size: 8 preemption_policy: swap priority: fcfs prefix_cache: enabled: true ttl_seconds: 3600 min_match_tokens: 16 Benchmarking KVStream ships with a built-in benchmarking command: kvstream bench \ --url http://localhost:8080 \ --model phi-3-mini \ --concurrency 8 \ --prompt-len 128 \ --output-len 64 \ --total-requests 100 Example output: ┌──────────────────────────┐ │ KVStream Benchmark │ ├──────────────┬───────────┤ │ Requests │ 100 │ │ Concurrency │ 8 │ │ Errors │ 0 │ │ Throughput │ 12.4 req/s│ │ p50 │ 612 ms │ │ p99 │ 1840 ms │ └──────────────┴───────────┘ KVStream covers the most common local inference runtimes today Foundry Local, Ollama, llama.cpp, and LM Studio through a single, consistent interface. If you are building on Foundry Local and running into the throughput or memory fragmentation issues described here, KVStream is designed to slot in with a single command and zero changes to your application code. pip install kvstream kvstream serve --backend foundry --model phi-3-mini The full integration guide, configuration reference, and Docker deployment instructions are available in the KVStream Documentation on Github. We are always looking to improve! If you want to help make KVStream even better, check out our Contributing Guide to get started on your first pull request.
shreyanfern
Jun 23, 2026 Place Educator Developer Blog
102Views
0likes
1Comment
How to Delete Microsoft Teams Cache for All Users via PowerShell
Improve Microsoft Teams performance by deleting the cache for all users. This article provides detailed steps on how to clear the cache, which can reduce clutter and improve application speed. By following these steps, you can ensure that your Microsoft Teams application is running smoothly and efficiently.
AnthonyBartolo
Jul 22, 2024 Place Educator Developer Blog
41KViews
0likes
9Comments