Blog Post

Microsoft Developer Community Blog
9 MIN READ

oBeaver — A Beaver That Runs LLMs on Your Machine 🦫

kinfey's avatar
kinfey
Icon for Microsoft rankMicrosoft
Apr 03, 2026


Hi there! I'm the creator of oBeaver.

This project started from a pretty simple desire: I wanted to run large language models on my own computer. No data sent to the cloud. No API keys. No per-call charges. I'm guessing you've had the same thought.

There are already great tools out there — Ollama being the most prominent. But in my day-to-day work, I spend a lot of time in the ONNX ecosystem — the cross-platform reach of ONNX Runtime, its native NPU support, the turnkey experience of Microsoft Foundry Local. It kept nagging at me: the ONNX ecosystem deserves a more complete local inference toolkit. That's how oBeaver was born.

Here are the links if you want to jump straight in:

Up and Running in Three Minutes

Getting started with oBeaver is dead simple. You need Python 3.12+, then it's clone, install, chat:

git clone https://github.com/microsoft/obeaver.git
cd obeaver
pip install -e .

# Initialize the model directory (auto-creates ort/, foundrylocal/, cache_dir/ sub-folders)
obeaver init

# Make sure everything looks good
obeaver check

If you're on macOS or Windows, install Foundry Local and you're one command away from chatting with a model:

obeaver run phi-4-mini

The first run downloads the model automatically — give it a minute. After that, it's instant.

On Linux, or if you want to use models from Hugging Face, the ORT engine has you covered:

# Convert Qwen3-0.6B from Hugging Face to ONNX format
obeaver convert Qwen/Qwen3-0.6B

# Run it with the ORT engine
obeaver run --engine ort ./models/ort/Qwen3-0.6B_ONNX_INT4_CPU

Want to turn your model into an HTTP service? One line:

obeaver serve Phi-4-mini

Then point any OpenAI-compatible client at it — just change one base_url and your existing code works as-is:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:18000/v1", api_key="unused")

response = client.chat.completions.create(
    model="Phi-4-mini",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

LangChain, LlamaIndex, Microsoft Agent Framework, CrewAI — anything that speaks the OpenAI protocol plugs right in. This was a non-negotiable design principle from day one: local inference shouldn't be an island; it should fit seamlessly into your existing dev workflow.

"Why Not Just Use Ollama?"

I get this question a lot, and it deserves a straight answer.

Ollama is a fantastic project. It pioneered the "one command to run a model" experience and made local LLM inference accessible to everyone. If all you need is a quick way to chat with a model locally, Ollama is still a wonderful choice. oBeaver itself draws heavy inspiration from it.

But Ollama and oBeaver take different technical paths. Ollama is built on llama.cpp and uses the GGUF model format. oBeaver is built on ONNX Runtime and uses the ONNX model format. Behind these two formats are two very different philosophies.

GGUF: Grab and Go

GGUF's strength is ultimate portability. One file bundles everything — weights, tokenizer, metadata. Hugging Face is packed with pre-quantized GGUF models ready to download and run. Quantization options are rich (Q2_K through Q8_0), and the community is incredibly active. For individual developers, this "grab and go" experience is hard to beat.

ONNX: Convert Once, Accelerate Everywhere

ONNX shines in a different dimension. As a cross-platform industrial standard, ONNX Runtime has something called Execution Providers — the same ONNX model, without any changes, can run on CPU, GPU, and even NPU.

This matters more than it might seem at first glance. With chips like Intel Core Ultra, Qualcomm Snapdragon X, and Apple Neural Engine becoming mainstream, NPUs are quickly becoming standard hardware in AI PCs. ONNX Runtime already supports NPU acceleration natively, while the GGUF ecosystem doesn't have this capability yet. This means ONNX naturally adapts to a far wider range of devices — from servers to laptops, from desktops to edge devices, even phones and IoT endpoints. The ONNX model you run on CPU today can be accelerated on an NPU-equipped machine tomorrow — no re-conversion, no code changes, just switch the Execution Provider.

ONNX does have a higher barrier to entry — models need to be converted first. But oBeaver's built-in obeaver convert command, powered by Microsoft's Olive toolkit, reduces that to a single line.

Another project worth mentioning is oMLX, which also explores local inference in the ONNX ecosystem, but focuses specifically on Apple Silicon. oBeaver aims to be more comprehensive — spanning macOS, Windows, and Linux, covering text chat, embeddings, and vision-language scenarios.

Here's a quick comparison of all three:

 OllamaoMLXoBeaver
Model formatGGUFONNXONNX
Inference backendllama.cppONNX RuntimeFoundry Local + ORT GenAI
PlatformsmacOS/Linux/WindowsmacOSmacOS/Windows/Linux
NPU acceleration
Embedding models
VL models
Function Calling
Docker deployment

I'm not saying oBeaver is better than Ollama. They serve different needs. But if your work involves the ONNX ecosystem, NPU acceleration, or a combination of embedding and multimodal capabilities, oBeaver offers a path that Ollama doesn't currently cover.

Why a "Dual Engine"?

This is oBeaver's most distinctive design decision, and the one I spent the most time thinking about.

oBeaver has two inference engines under the hood: Foundry Local and ONNX Runtime GenAI (ORT). Why not just pick one? Because reality is messier than ideals.

Foundry Local is Microsoft's local inference runtime, and the experience is lovely — pass a catalog alias like Phi-4-mini, and it auto-downloads, loads, and runs the model with smart hardware scheduling (NPU > GPU > CPU). But it has two clear limitations: first, the model catalog is still small, mostly centered around Microsoft's Phi family; second, it only supports macOS and Windows — Linux users are left out.

ONNX Runtime GenAI fills exactly those gaps. It supports macOS, Windows, and Linux — all three platforms. And with obeaver convert, you can transform almost any model on Hugging Face into ONNX format, giving you a much wider model selection. Right now, oBeaver can already run models from Phi, Qwen, Gemma, GLM, and other SLM families through the ORT engine. On top of that, the ORT engine powers capabilities that Foundry Local simply can't do:

Embedding models — The ORT engine includes a dedicated embedding engine supporting Qwen3-Embedding and EmbeddingGemma, perfect for local RAG pipelines:

# Start the embedding service
obeaver serve-embed ./models/Qwen3-Embedding-0.6B
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:18001/v1", api_key="unused")

response = client.embeddings.create(
    model="Qwen3-Embedding-0.6B",
    input=["Hello, world!", "Embeddings are useful."],
)
for item in response.data:
    print(f"index={item.index}  dim={len(item.embedding)}")

Vision-Language models (VL) — When the ORT engine detects vision.onnx in a model directory, it automatically switches to VL mode. Currently supported: Qwen-2.5-VL-3B and Qwen-3-VL-2B. You can send images alongside text for multimodal understanding:

obeaver serve ./models/Qwen3-VL-2B-Instruct_VL_ONNX_INT4_CPU

Converting a VL model is just one command too:

obeaver convert Qwen/Qwen2.5-VL-3B-Instruct --type vl

So the dual engine isn't redundancy — it's the optimal choice given reality: Foundry Local covers only macOS/Windows; ORT GenAI covers all platforms. Foundry Local has fewer models but zero friction; ORT GenAI has more models and more flexibility. oBeaver automatically picks the right engine for your platform and task — Foundry Local by default on macOS/Windows, ORT on Linux, auto-switching to ORT for embedding or VL workloads. You can always override with --engine ort.

In short: Foundry Local handles the "just works" path, ORT handles the "I need more" path. Together, they give oBeaver an answer for every platform and every scenario.

Cloud-Native? Of Course

oBeaver isn't just a local toy. Deployment was baked into the design from the start.

The architecture is cleanly layered: CLI (Typer) → FastAPI Server → pluggable inference engines. We ship a Docker image supporting both linux/amd64 and linux/arm64 (Apple Silicon included):

# Build the image
docker buildx build --platform=linux/amd64 \
  -f docker/Dockerfile.cpu -t obeaver-cpu .

# Start the API server
docker run -d --rm -p 18000:18000 \
  -v /path/to/models:/models \
  obeaver-cpu serve -m /models -E ort --host 0.0.0.0 --port 18000

Local dev, CI/CD pipelines, headless servers, Kubernetes clusters — it all works. Combined with the OpenAI-compatible API, you can develop against oBeaver locally and switch to a cloud endpoint in production by changing a single URL. Not a single line of application code needs to change.

Not Just a CLI — There's a Dashboard Too

So far everything I've shown has been terminal commands. But sometimes you just want a visual interface — especially when you're evaluating models, comparing performance, or showing a demo.

oBeaver ships with a built-in web dashboard. One command to launch:

obeaver dashboard               # Foundry Local engine (macOS/Windows)
obeaver dashboard -e ort         # ORT engine (scans local ONNX models)

Open http://127.0.0.1:1573/ and you'll see something like this:

It's a real-time monitoring and chat interface rolled into one. Here's what you get:

Model Selector — Switch between your cached models on the fly. If a model supports NPU acceleration, it's marked with a ⚡ badge. With Foundry Local, you'll see the models from your local catalog:

With the ORT engine, it scans your model directory for all available ONNX models:

Chat + Live Benchmarking — Send messages and get streaming responses, with real-time performance stats right in the interface — TTFT (Time to First Token), tokens per second, total token count. This makes it incredibly easy to benchmark different models side by side:

System Monitoring — Real-time memory gauges for CPU, GPU, NPU, and process memory. A system info bar shows the current model, engine type, platform, and health status at a glance.

Inference Parameters — Adjust temperature, top-p, top-k, and max tokens with built-in presets, all without restarting the server.

VL Mode — When you load a Vision-Language model in the ORT dashboard, the interface automatically switches to a dedicated VL mode where you can provide an image URL alongside your text prompt:

And more — Conversation history with save/load, system prompt configuration, live server logs showing every request with method/path/status/timing, and export to JSON or Markdown.

The dashboard isn't a separate product — it's just obeaver dashboard. Everything runs locally, nothing phones home. It's particularly useful when you want to quickly evaluate how a model performs on your hardware before committing to it in your application.

Being Honest: CPU Only for Now

oBeaver is currently in Tech Preview, and I want to be upfront about this — it only supports CPU inference right now.

This is a deliberate, stage-by-stage choice. We wanted to make sure the entire toolchain — model conversion, inference, API serving, Docker deployment — is rock solid on CPU first. Almost every machine has a CPU; it's the best baseline for validating the complete workflow.

But GPU and NPU support are coming soon. They're at the very top of the roadmap. ONNX Runtime already ships mature CUDA (GPU) and QNN/OpenVINO (NPU) Execution Providers. Foundry Local already has NPU > GPU > CPU auto-scheduling built in. What oBeaver needs to do is integrate these into its engine selection logic and model conversion pipeline — and that work is actively underway.

Ultimately, one of the key reasons oBeaver chose the ONNX path is the NPU future. The AI PC era is arriving, and when NPUs become standard hardware, ONNX will be the ecosystem most ready for it.

Acknowledgements

oBeaver is inspired by and builds upon the ideas from the following excellent projects:

ProjectDescription
OllamaRun large language models locally with a simple CLI
OMLXRun large language models on Apple Silicon, ONNX-based
vLLMHigh-throughput and memory-efficient inference engine for LLMs
Foundry LocalMicrosoft's local model inference runtime with NPU/GPU/CPU acceleration
ONNX Runtime GenAIGenerative AI extensions for ONNX Runtime
OliveMicrosoft's model optimization toolkit for ONNX Runtime

I Need Your Feedback

That's the tour. But oBeaver is still in its early days, and there's so much room to improve.

As the creator of this project, what I fear most isn't criticism — it's silence. So I genuinely hope you'll give it a try and let me know what you think:

  • Which models do you most want to run?
  • How urgent is GPU / NPU acceleration for your use case?
  • What do you think of the dual-engine design — does it add value, or does it add complexity?
  • In your real-world projects, what's the biggest pain point with local inference?
  • What else does the Docker story need? Helm Charts? Compose files?

GitHub Issues, PRs, or just reaching out on social media — any form of feedback is deeply appreciated.

The name oBeaver comes from the beaver — nature's most remarkable engineer. Beavers build dams stick by stick, creating the environment they need to thrive. I hope oBeaver can help you do the same: build your local AI infrastructure, one piece at a time, on your own hardware.

Build local. Dam the cloud. 🦫

If you find oBeaver useful, a ⭐ on GitHub means the world to us!

Updated Apr 01, 2026
Version 1.0
No CommentsBe the first to comment