A hands-on guide to building real-world AI automation with Foundry Local, the Microsoft Agent Framework, and PyBullet. No cloud subscription, no API keys, no internet required.

Why Developers Should Care About Offline AI
Imagine telling a robot arm to "pick up the cube" and watching it execute the command in a physics simulator, all powered by a language model running on your laptop. No API calls leave your machine. No token costs accumulate. No internet connection is needed.
That is what this project delivers, and every piece of it is open source and ready for you to fork, extend, and experiment with.
Most AI demos today lean on cloud endpoints. That works for prototypes, but it introduces latency, ongoing costs, and data privacy concerns. For robotics and industrial automation, those trade-offs are unacceptable. You need inference that runs where the hardware is: on the factory floor, in the lab, or on your development machine.
Foundry Local gives you an OpenAI-compatible endpoint running entirely on-device. Pair it with a multi-agent orchestration framework and a physics engine, and you have a complete pipeline that translates natural language into validated, safe robot actions.
This post walks through how we built it, why the architecture works, and how you can start experimenting with your own offline AI simulators today.
Architecture
The system uses four specialised agents orchestrated by the Microsoft Agent Framework:
| Agent | What It Does | Speed |
|---|---|---|
| PlannerAgent | Sends user command to Foundry Local LLM → JSON action plan | 4–45 s |
| SafetyAgent | Validates against workspace bounds + schema | < 1 ms |
| ExecutorAgent | Dispatches actions to PyBullet (IK, gripper) | < 2 s |
| NarratorAgent | Template summary (LLM opt-in via env var) | < 1 ms |
User (text / voice)
│
▼
┌──────────────┐
│ Orchestrator │
└──────┬───────┘
│
┌────┴────┐
▼ ▼
Planner Narrator
│
▼
Safety
│
▼
Executor
│
▼
PyBullet
Setting Up Foundry Local
from foundry_local import FoundryLocalManager import openai manager = FoundryLocalManager("qwen2.5-coder-0.5b") client = openai.OpenAI( base_url=manager.endpoint, api_key=manager.api_key, ) resp = client.chat.completions.create( model=manager.get_model_info("qwen2.5-coder-0.5b").id, messages=[{"role": "user", "content": "pick up the cube"}], max_tokens=128, stream=True, )
from foundry_local import FoundryLocalManager
import openai
manager = FoundryLocalManager("qwen2.5-coder-0.5b")
client = openai.OpenAI(
base_url=manager.endpoint,
api_key=manager.api_key,
)
resp = client.chat.completions.create(
model=manager.get_model_info("qwen2.5-coder-0.5b").id,
messages=[{"role": "user", "content": "pick up the cube"}],
max_tokens=128,
stream=True,
)
The SDK auto-selects the best hardware backend (CUDA GPU → QNN NPU → CPU). No configuration needed.
How the LLM Drives the Simulator
Understanding the interaction between the language model and the physics simulator is central to the project. The two never communicate directly. Instead, a structured JSON contract forms the bridge between natural language and physical motion.
From Words to JSON
When a user says “pick up the cube”, the PlannerAgent sends the command to the Foundry Local LLM alongside a compact system prompt. The prompt lists every permitted tool and shows the expected JSON format. The LLM responds with a structured plan:
{
"type": "plan",
"actions": [
{"tool": "describe_scene", "args": {}},
{"tool": "pick", "args": {"object": "cube_1"}}
]
}
The planner parses this response, validates it against the action schema, and retries once if the JSON is malformed. This constrained output format is what makes small models (0.5B parameters) viable: the response space is narrow enough that even a compact model can produce correct JSON reliably.
From JSON to Motion
Once the SafetyAgent approves the plan, the ExecutorAgent maps each action to concrete PyBullet calls:
move_ee(target_xyz): The target position in Cartesian coordinates is passed to PyBullet's inverse kinematics solver, which computes the seven joint angles needed to place the end-effector at that position. The robot then interpolates smoothly from its current joint state to the target, stepping the physics simulation at each increment.pick(object): This triggers a multi-step grasp sequence. The controller looks up the object's position in the scene, moves the end-effector above the object, descends to grasp height, closes the gripper fingers with a configurable force, and lifts. At every step, PyBullet resolves contact forces and friction so that the object behaves realistically.place(target_xyz): The reverse of a pick. The robot carries the grasped object to the target coordinates and opens the gripper, allowing the physics engine to drop the object naturally.describe_scene(): Rather than moving the robot, this action queries the simulation state and returns the position, orientation, and name of every object on the table, along with the current end-effector pose.
The Abstraction Boundary
The critical design choice is that the LLM knows nothing about joint angles, inverse kinematics, or physics. It operates purely at the level of high-level tool calls (pick, move_ee). The ActionExecutor translates those tool calls into the low-level API that PyBullet provides. This separation means the LLM prompt stays simple, the safety layer can validate plans without understanding kinematics, and the executor can be swapped out without retraining or re-prompting the model.
Voice Input Pipeline

Voice commands follow three stages:
- Browser capture:
MediaRecordercaptures audio, client-side resamples to 16 kHz mono WAV - Server transcription: Foundry Local Whisper (ONNX, cached after first load) with automatic 30 s chunking
- Command execution: transcribed text goes through the same Planner → Safety → Executor pipeline
The mic button (🎤) only appears when a Whisper model is cached or loaded. Whisper models are filtered out of the LLM dropdown.
Web UI in Action
Performance: Model Choice Matters
| Model | Params | Inference | Pipeline Total |
|---|---|---|---|
qwen2.5-coder-0.5b | 0.5 B | ~4 s | ~5 s |
phi-4-mini | 3.6 B | ~35 s | ~36 s |
qwen2.5-coder-7b | 7 B | ~45 s | ~46 s |
For interactive robot control, qwen2.5-coder-0.5b is the clear winner: valid JSON for a 7-tool schema in under 5 seconds.
The Simulator in Action
Here is the Panda robot arm performing a pick-and-place sequence in PyBullet. Each frame is rendered by the simulator's built-in camera and streamed to the web UI in real time.
Get Running in Five Minutes
You do not need a GPU, a cloud account, or any prior robotics experience. The entire stack runs on a standard development machine.
# 1. Install Foundry Local
winget install Microsoft.FoundryLocal # Windows
brew install foundrylocal # macOS
# 2. Download models (one-time, cached locally)
foundry model run qwen2.5-coder-0.5b # Chat brain (~4 s inference)
foundry model run whisper-base # Voice input (194 MB)
# 3. Clone and set up the project
git clone https://github.com/leestott/robot-simulator-foundrylocal
cd robot-simulator-foundrylocal
.\setup.ps1 # or ./setup.sh on macOS/Linux
# 4. Launch the web UI
python -m src.app --web --no-gui # → http://localhost:8080
Once the server starts, open your browser and try these commands in the chat box:
- "pick up the cube": the robot grasps the blue cube and lifts it
- "describe the scene": returns every object's name and position
- "move to 0.3 0.2 0.5": sends the end-effector to specific coordinates
- "reset": returns the arm to its neutral pose
If you have a microphone connected, hold the mic button and speak your command instead of typing. Voice input uses a local Whisper model, so your audio never leaves the machine.
Experiment and Build Your Own
The project is deliberately simple so that you can modify it quickly. Here are some ideas to get started.
Add a new robot action
The robot currently understands seven tools. Adding an eighth takes four steps:
- Define the schema in
TOOL_SCHEMAS(src/brain/action_schema.py). - Write a
_do_<tool>handler insrc/executor/action_executor.py. - Register it in
ActionExecutor._dispatch. - Add a test in
tests/test_executor.py.
For example, you could add a rotate_ee tool that spins the end-effector to a given roll/pitch/yaw without changing position.
Add a new agent
Every agent follows the same pattern: an async run(context) method that reads from and writes to a shared dictionary. Create a new file in src/agents/, register it in orchestrator.py, and the pipeline will call it in sequence.
Ideas for new agents:
- VisionAgent: analyse a camera frame to detect objects and update the scene state before planning.
- CostEstimatorAgent: predict how many simulation steps an action plan will take and warn the user if it is expensive.
- ExplanationAgent: generate a step-by-step natural language walkthrough of the plan before execution, allowing the user to approve or reject it.
Swap the LLM
python -m src.app --web --model phi-4-mini
Or use the model dropdown in the web UI; no restart is needed. Try different models and compare accuracy against inference speed. Smaller models are faster but may produce malformed JSON more often. Larger models are more accurate but slower. The retry logic in the planner compensates for occasional failures, so even a small model works well in practice.
Swap the simulator
PyBullet is one option, but the architecture does not depend on it. You could replace the simulation layer with:
- MuJoCo: a high-fidelity physics engine popular in reinforcement learning research.
- Isaac Sim: NVIDIA's GPU-accelerated robotics simulator with photorealistic rendering.
- Gazebo: the standard ROS simulator, useful if you plan to move to real hardware through ROS 2.
The only requirement is that your replacement implements the same interface as PandaRobot and GraspController.
Build something completely different
The pattern at the heart of this project (LLM produces structured JSON, safety layer validates, executor dispatches to a domain-specific engine) is not limited to robotics. You could apply the same architecture to:
- Home automation: "turn off the kitchen lights and set the thermostat to 19 degrees" translated into MQTT or Zigbee commands.
- Game AI: natural language control of characters in a game engine, with the safety agent preventing invalid moves.
- CAD automation: voice-driven 3D modelling where the LLM generates geometry commands for OpenSCAD or FreeCAD.
- Lab instrumentation: controlling scientific equipment (pumps, stages, spectrometers) via natural language, with the safety agent enforcing hardware limits.
From Simulator to Real Robot
One of the most common questions about projects like this is whether it could control a real robot. The answer is yes, and the architecture is designed to make that transition straightforward.
What Stays the Same
The entire upper half of the pipeline is hardware-agnostic:
- The LLM planner generates the same JSON action plans regardless of whether the target is simulated or physical. It has no knowledge of the underlying hardware.
- The safety agent validates workspace bounds and tool schemas. For a real robot, you would tighten the bounds to match the physical workspace and add checks for obstacle clearance using sensor data.
- The orchestrator coordinates agents in the same sequence. No changes are needed.
- The narrator reports what happened. It works with any result data the executor returns.
What Changes
The only component that must be replaced is the executor layer, specifically the PandaRobot class and the GraspController. In simulation, these call PyBullet's inverse kinematics solver and step the physics engine. On a real robot, they would instead call the hardware driver.
For a Franka Emika Panda (the same robot modelled in the simulation), the replacement options include:
- libfranka: Franka's C++ real-time control library, which accepts joint position or torque commands at 1 kHz.
- ROS 2 with MoveIt: A robotics middleware stack that provides motion planning, collision avoidance, and hardware abstraction. The
move_eeaction would become a MoveIt goal, and the framework would handle trajectory planning and execution. - Franka ROS 2 driver: Combines libfranka with ROS 2 for a drop-in replacement of the simulation controller.
The ActionExecutor._dispatch method maps tool names to handler functions. Replacing _do_move_ee, _do_pick, and _do_place with calls to a real robot driver is the only code change required.
Key Considerations for Real Hardware
- Safety: A simulated robot cannot cause physical harm; a real robot can. The safety agent would need to incorporate real-time collision checking against sensor data (point clouds from depth cameras, for example) rather than relying solely on static workspace bounds.
- Perception: In simulation, object positions are known exactly. On a real robot, you would need a perception system (cameras with object detection or fiducial markers) to locate objects before grasping.
- Calibration: The simulated robot's coordinate frame matches the URDF model perfectly. A real robot requires hand-eye calibration to align camera coordinates with the robot's base frame.
- Latency: Real actuators have physical response times. The executor would need to wait for motion completion signals from the hardware rather than stepping a simulation loop.
- Gripper feedback: In PyBullet, grasp success is determined by contact forces. A real gripper would provide force or torque feedback to confirm whether an object has been securely grasped.
The Simulation as a Development Tool
This is precisely why simulation-first development is valuable. You can iterate on the LLM prompts, agent logic, and command pipeline without risk to hardware. Once the pipeline reliably produces correct action plans in simulation, moving to a real robot is a matter of swapping the lowest layer of the stack.
Key Takeaways for Developers
- On-device AI is production-ready. Foundry Local serves models through a standard OpenAI-compatible API. If your code already uses the OpenAI SDK, switching to local inference is a one-line change to
base_url. - Small models are surprisingly capable. A 0.5B parameter model produces valid JSON action plans in under 5 seconds. For constrained output schemas, you do not need a 70B model.
- Multi-agent pipelines are more reliable than monolithic prompts. Splitting planning, validation, execution, and narration across four agents makes each one simpler to test, debug, and replace.
- Simulation is the safest way to iterate. You can refine LLM prompts, agent logic, and tool schemas without risking real hardware. When the pipeline is reliable, swapping the executor for a real robot driver is the only change needed.
- The pattern generalises beyond robotics. Structured JSON output from an LLM, validated by a safety layer, dispatched to a domain-specific engine: that pattern works for home automation, game AI, CAD, lab equipment, and any other domain where you need safe, structured control.
- You can start building today. The entire project runs on a standard laptop with no GPU, no cloud account, and no API keys. Clone the repository, run the setup script, and you will have a working voice-controlled robot simulator in under five minutes.