What is Foundry Local?
Foundry Local is a user-friendly tool that helps you run small AI language models directly on your Windows or Mac computer. Think of it as a way to have your own personal ChatGPT running locally on your machine, without needing an internet connection or sending your data to external servers.
Currently, Foundry Local works great with several popular model families:
- Phi models (Microsoft's small but powerful models)
- Qwen models (Alibaba's multilingual models)
- DeepSeek models (Efficient reasoning models)
In this tutorial, we'll learn how to set up the Qwen3-0.6B model step by step. Don't worry if you're new to AI - we'll explain everything along the way!
Why Do We Need to Convert AI Models?
When you download AI models from websites like Hugging Face (think of it as GitHub for AI models), they come in a format called PyTorch. While PyTorch is great for training models, it's not the best format for running them on your personal computer.
To make these models work efficiently on your laptop or desktop, we need to:
- Convert the format - Change it to something your computer can run faster
- Make it smaller - Compress the model so it uses less memory and storage
We'll use two main formats for this conversion:
GGUF vs ONNX: Which Format Should You Choose?
Think of these as different "languages" that your computer can understand. Since we're working with small language models (like Qwen3-0.6B), let's see which format works best for your needs:
GGUF (GPT-Generated Unified Format)
Best for: Basic computers, simple setups, or if you want maximum simplicity
Advantages:
- š¾ Super memory-efficient - Uses much less RAM through smart compression
- š One file, that's it - Everything you need is in a single file (no complicated folders)
- š» CPU-friendly - Works great on older computers or laptops without powerful graphics cards
- š ļø Simple tools - Easy to use with popular tools like llama.cpp
- ā” Quick to set up - Less configuration needed
Disadvantages:
- š Size matters - For small models, the format adds relatively more "overhead"
- š Limited flexibility - Only works with certain types of AI models (transformer-based)
ONNX (Open Neural Network Exchange)
Best for: Modern computers, when you want the best performance, or professional use
Advantages:
- š Works with everything - Compatible with many different AI model types and architectures
- š Hardware acceleration - Can use your graphics card (GPU) or special AI chips (NPU) for much faster performance
- š Professional-grade - Used by companies in production environments
- š§ Flexible conversion - Easy to convert from almost any AI training framework
- š± Mobile-ready - Great support for running on phones and tablets
- āļø Smart optimization - ONNX Runtime automatically makes your model run faster
Disadvantages:
- š¦ More complex - Multiple files and folders to manage
- š¾ Larger file sizes - Takes up more storage space
- š ļø More setup - Requires a bit more configuration to get running
Our Recommendation for Beginners
For this tutorial, we'll use ONNX because:
- ā It gives you the best performance on most modern computers
- ā You can upgrade to GPU acceleration later if you want
- ā It's the industry standard that you'll encounter in most AI projects
- ā Foundry Local works excellently with ONNX models
Meet Microsoft Olive: Your Model Conversion Helper
Microsoft Olive is like a smart assistant that helps convert AI models for you. Instead of doing all the technical work manually, Olive automates the process and makes sure everything works correctly.
Here's what makes Olive special:
- Works with any computer setup - Whether you have a basic laptop or a gaming PC with a powerful graphics card
- Does the work for you - No need to learn complex conversion commands
- Multiple compression options - Can make your model smaller in different ways (INT4, INT8, FP16 - don't worry about these terms for now!)
- Plays well with others - Works seamlessly with other AI tools you might use
Let's Convert Your Model: Step-by-Step Guide
Don't worry if you've never done this before - we'll go through each step carefully!
Step 1: Install the Tools We Need
First, we need to install some software tools. Think of this like downloading apps on your phone - each tool has a specific job to help us convert the model.
Open your terminal (Command Prompt on Windows, Terminal on Mac) and run these commands one by one:
# This updates the main AI library to the latest version
pip install transformers -U
# This installs Microsoft Olive (our conversion helper)
pip install git+https://github.com/microsoft/Olive.git
# This downloads and installs additional AI tools
git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai && python build.py --config Release
pip install {Your build release path}/onnxruntime_genai-0.9.0.dev0-cp311-cp311-linux_x86_64.whl
š Important note: You'll also need CMake version 3.31 or newer. If you don't have it, you can download it from cmake.org.
Step 2: The Easy Way - One Command Conversion
Once everything is installed, converting your model is surprisingly simple! Just run this command (but replace {Your Qwen3-0.6B Path} with the actual location where you downloaded the model):
olive auto-opt \
--model_name_or_path {Your Qwen3-0.6B Path} \
--device cpu \
--provider CPUExecutionProvider \
--use_model_builder \
--precision int4 \
--output_path models/Qwen3-0.6B/onnx \
--log_level 1
What does this command do?
- --device cpu means we're optimizing for your computer's processor
- --precision int4 makes the model smaller (about 75% size reduction!)
- --output_path tells Olive where to save the converted model
š” Tip: If you have a powerful graphics card, you can change cpu to cuda for potentially better performance.
Step 3: The Advanced Way - Using a Configuration File
For those who want more control, you can create a configuration file. This is like creating a recipe that tells Olive exactly how you want your model converted.
Create a new file called conversion_config.json and add this content:
{
"input_model": {
"type": "HfModel",
"model_path": "Qwen/Qwen3-0.6B",
"task": "text-generation"
},
"systems": {
"local_system": {
"type": "LocalSystem",
"accelerators": [
{
"execution_providers": [
"CPUExecutionProvider"
]
}
]
}
},
"passes": {
"builder": {
"type": "ModelBuilder",
"config": {
"precision": "int4"
}
}
},
"host": "local_system",
"target": "local_system",
"cache_dir": "cache",
"output_dir": "model/output/Qwen3-0.6B-ONNX"
}
Then run this command:
olive run --config ./conversion_config.json
š Before you start: If this is your first time downloading models from Hugging Face, you'll need to log in first:
huggingface-cli login
This will ask for your Hugging Face token (you can get one free from their website).
Setting Up Your Converted Model in Foundry Local
Great! Now that you have a converted model, let's get it running in Foundry Local. This is like installing a new app on your computer.
What You'll Need
- ā Foundry Local installed on your computer
- ā Your freshly converted ONNX model from the previous steps
- ā A few minutes to set everything up
Getting Started
First, let's navigate to where Foundry Local stores its models:
foundry cache cd ./models/
This command takes you to the "models folder" - think of it as the app store for your AI models.
Step 1: Create a Chat Template
AI models need to know how to format conversations. It's like teaching them the "grammar" of chatting. Create a new file called inference_model.json with this content:
{
"Name": "Qwen3-0.6b-cpu",
"PromptTemplate": {
"system": "<|im_start|>system\n{Content}<|im_end|>",
"user": "<|im_start|>user\n/think{Content}<|im_end|>",
"assistant": "<|im_start|>assistant\n{Content}<|im_end|>",
"prompt": "<|im_start|>user\n/think{Content}<|im_end|>\n<|im_start|>assistant"
}
}
š¤ What's this "think" thing? Qwen models have a special feature where they can "think out loud" before giving you an answer. It's like showing their work in math class! This often leads to better, more thoughtful responses. If you don't want this feature, just remove /think from the templates above.
Step 2: Organize Your Files
Create a neat folder structure for your model. This helps Foundry Local find everything easily:
# Create a folder for your model
mkdir -p ./models/qwen/Qwen3-0.6B
# Copy your converted files here
# (You'll need to move your ONNX files and the inference_model.json to this folder)
š Why this structure?
- qwen = the company that made the model
- Qwen3-0.6B = the specific model name
Step 3: Check If Everything Worked
Let's verify that Foundry Local can see your new model:
foundry cache ls
You should see Qwen3-0.6b-cpu in the list. If you don't see it, double-check that your files are in the right place.
Step 4: Take It for a Test Drive!
The moment of truth - let's start chatting with your model:
foundry model run Qwen3-0.6b-cpu
If everything worked correctly, you should see your model starting up, and you can begin asking it questions!
Troubleshooting: When Things Don't Go as Planned
Don't worry if you run into issues - this is totally normal! Here are the most common problems and how to fix them:
Problem: "Model not found" error
What happened: Foundry Local can't find your model files How to fix it:
- Double-check that your files are in the right folder: ./models/qwen/Qwen3-0.6B/
- Make sure the inference_model.json file is in the same folder as your ONNX files
- Check that the model name in the JSON file matches what you're trying to run
Problem: Model starts but gives weird responses
What happened: The chat template might not be set up correctly How to fix it:
- Check your inference_model.json file for typos
- Make sure the special characters (<|im_start|>, <|im_end|>) are exactly as shown
- Try removing the /think part if you're getting strange outputs
Problem: Model runs very slowly
What happened: Your computer might be working harder than it needs to How to fix it:
- Close other programs to free up memory
- If you have a good graphics card, try the GPU version instead of CPU
- Consider using a smaller model if performance is still poor
Problem: Installation commands fail
What happened: Something went wrong during setup How to fix it:
- Make sure you have Python installed (version 3.8 or newer)
- Try running the commands one at a time instead of all at once
- Check your internet connection - some downloads are quite large
Congratulations! You Did It! š
You've successfully:
- ā Learned the difference between model formats
- ā Converted a PyTorch model to ONNX format
- ā Set up your own local AI assistant
- ā Got it running on your personal computer
Have questions or run into issues? The AI Discord is very helpful - don't hesitate to ask for help on forums or in the Foundry Local repo.