Educator Developer Blog

8 MIN READ

How to Use Custom Models with Foundry Local: A Beginner's Guide

kinfey

Microsoft

Aug 18, 2025

What is Foundry Local?

Foundry Local is a user-friendly tool that helps you run small AI language models directly on your Windows or Mac computer. Think of it as a way to have your own personal ChatGPT running locally on your machine, without needing an internet connection or sending your data to external servers.

Currently, Foundry Local works great with several popular model families:

Phi models (Microsoft's small but powerful models)
Qwen models (Alibaba's multilingual models)
DeepSeek models (Efficient reasoning models)

In this tutorial, we'll learn how to set up the Qwen3-0.6B model step by step. Don't worry if you're new to AI - we'll explain everything along the way!

Why Do We Need to Convert AI Models?

When you download AI models from websites like Hugging Face (think of it as GitHub for AI models), they come in a format called PyTorch. While PyTorch is great for training models, it's not the best format for running them on your personal computer.

To make these models work efficiently on your laptop or desktop, we need to:

Convert the format - Change it to something your computer can run faster
Make it smaller - Compress the model so it uses less memory and storage

We'll use two main formats for this conversion:

GGUF vs ONNX: Which Format Should You Choose?

Think of these as different "languages" that your computer can understand. Since we're working with small language models (like Qwen3-0.6B), let's see which format works best for your needs:

GGUF (GPT-Generated Unified Format)

Best for: Basic computers, simple setups, or if you want maximum simplicity

Advantages:

💾 Super memory-efficient - Uses much less RAM through smart compression
📁 One file, that's it - Everything you need is in a single file (no complicated folders)
💻 CPU-friendly - Works great on older computers or laptops without powerful graphics cards
🛠️ Simple tools - Easy to use with popular tools like llama.cpp
⚡ Quick to set up - Less configuration needed

Disadvantages:

📏 Size matters - For small models, the format adds relatively more "overhead"
🔒 Limited flexibility - Only works with certain types of AI models (transformer-based)

ONNX (Open Neural Network Exchange)

Best for: Modern computers, when you want the best performance, or professional use

Advantages:

🔄 Works with everything - Compatible with many different AI model types and architectures
🚀 Hardware acceleration - Can use your graphics card (GPU) or special AI chips (NPU) for much faster performance
🏭 Professional-grade - Used by companies in production environments
🔧 Flexible conversion - Easy to convert from almost any AI training framework
📱 Mobile-ready - Great support for running on phones and tablets
⚙️ Smart optimization - ONNX Runtime automatically makes your model run faster

Disadvantages:

📦 More complex - Multiple files and folders to manage
💾 Larger file sizes - Takes up more storage space
🛠️ More setup - Requires a bit more configuration to get running

Our Recommendation for Beginners

For this tutorial, we'll use ONNX because:

✅ It gives you the best performance on most modern computers
✅ You can upgrade to GPU acceleration later if you want
✅ It's the industry standard that you'll encounter in most AI projects
✅ Foundry Local works excellently with ONNX models

Meet Microsoft Olive: Your Model Conversion Helper

Microsoft Olive is like a smart assistant that helps convert AI models for you. Instead of doing all the technical work manually, Olive automates the process and makes sure everything works correctly.

Here's what makes Olive special:

Works with any computer setup - Whether you have a basic laptop or a gaming PC with a powerful graphics card
Does the work for you - No need to learn complex conversion commands
Multiple compression options - Can make your model smaller in different ways (INT4, INT8, FP16 - don't worry about these terms for now!)
Plays well with others - Works seamlessly with other AI tools you might use

Let's Convert Your Model: Step-by-Step Guide

Don't worry if you've never done this before - we'll go through each step carefully!

Step 1: Install the Tools We Need

First, we need to install some software tools. Think of this like downloading apps on your phone - each tool has a specific job to help us convert the model.

Open your terminal (Command Prompt on Windows, Terminal on Mac) and run these commands one by one:

# This updates the main AI library to the latest version

pip install transformers -U

# This installs Microsoft Olive (our conversion helper)

pip install git+https://github.com/microsoft/Olive.git

# This downloads and installs additional AI tools

git clone https://github.com/microsoft/onnxruntime-genai

cd onnxruntime-genai && python build.py --config Release

pip install {Your build release path}/onnxruntime_genai-0.9.0.dev0-cp311-cp311-linux_x86_64.whl

📝 Important note: You'll also need CMake version 3.31 or newer. If you don't have it, you can download it from cmake.org.

Step 2: The Easy Way - One Command Conversion

Once everything is installed, converting your model is surprisingly simple! Just run this command (but replace {Your Qwen3-0.6B Path} with the actual location where you downloaded the model):

olive auto-opt \

    --model_name_or_path {Your Qwen3-0.6B Path} \

    --device cpu \

    --provider CPUExecutionProvider \

    --use_model_builder \

    --precision int4 \

    --output_path models/Qwen3-0.6B/onnx \

    --log_level 1

What does this command do?

--device cpu means we're optimizing for your computer's processor
--precision int4 makes the model smaller (about 75% size reduction!)
--output_path tells Olive where to save the converted model

💡 Tip: If you have a powerful graphics card, you can change cpu to cuda for potentially better performance.

Step 3: The Advanced Way - Using a Configuration File

For those who want more control, you can create a configuration file. This is like creating a recipe that tells Olive exactly how you want your model converted.

Create a new file called conversion_config.json and add this content:

{

    "input_model": {

        "type": "HfModel",

        "model_path": "Qwen/Qwen3-0.6B",

        "task": "text-generation"

    },

    "systems": {

        "local_system": {

            "type": "LocalSystem",

            "accelerators": [

                {

                    "execution_providers": [

                        "CPUExecutionProvider"

                    ]

                }

            ]

        }

    },

    "passes": {

        "builder": {

            "type": "ModelBuilder",

            "config": {

                "precision": "int4"

            }

        }

    },

    "host": "local_system",

    "target": "local_system",

    "cache_dir": "cache",

    "output_dir": "model/output/Qwen3-0.6B-ONNX"

}

Then run this command:

olive run --config ./conversion_config.json

🔐 Before you start: If this is your first time downloading models from Hugging Face, you'll need to log in first:

huggingface-cli login

This will ask for your Hugging Face token (you can get one free from their website).

Setting Up Your Converted Model in Foundry Local

Great! Now that you have a converted model, let's get it running in Foundry Local. This is like installing a new app on your computer.

What You'll Need

✅ Foundry Local installed on your computer
✅ Your freshly converted ONNX model from the previous steps
✅ A few minutes to set everything up

Getting Started

First, let's navigate to where Foundry Local stores its models:

foundry cache cd ./models/

This command takes you to the "models folder" - think of it as the app store for your AI models.

Step 1: Create a Chat Template

AI models need to know how to format conversations. It's like teaching them the "grammar" of chatting. Create a new file called inference_model.json with this content:

{

  "Name": "Qwen3-0.6b-cpu",

  "PromptTemplate": {

    "system": "<|im_start|>system\n{Content}<|im_end|>",

    "user": "<|im_start|>user\n/think{Content}<|im_end|>",

    "assistant": "<|im_start|>assistant\n{Content}<|im_end|>",

    "prompt": "<|im_start|>user\n/think{Content}<|im_end|>\n<|im_start|>assistant"

  }

}

🤔 What's this "think" thing? Qwen models have a special feature where they can "think out loud" before giving you an answer. It's like showing their work in math class! This often leads to better, more thoughtful responses. If you don't want this feature, just remove /think from the templates above.

Step 2: Organize Your Files

Create a neat folder structure for your model. This helps Foundry Local find everything easily:

# Create a folder for your model

mkdir -p ./models/qwen/Qwen3-0.6B

# Copy your converted files here

# (You'll need to move your ONNX files and the inference_model.json to this folder)

📁 Why this structure?

qwen = the company that made the model
Qwen3-0.6B = the specific model name

Step 3: Check If Everything Worked

Let's verify that Foundry Local can see your new model:

foundry cache ls

You should see Qwen3-0.6b-cpu in the list. If you don't see it, double-check that your files are in the right place.

Step 4: Take It for a Test Drive!

The moment of truth - let's start chatting with your model:

foundry model run Qwen3-0.6b-cpu

If everything worked correctly, you should see your model starting up, and you can begin asking it questions!

Troubleshooting: When Things Don't Go as Planned

Don't worry if you run into issues - this is totally normal! Here are the most common problems and how to fix them:

Problem: "Model not found" error

What happened: Foundry Local can't find your model files How to fix it:

Double-check that your files are in the right folder: ./models/qwen/Qwen3-0.6B/
Make sure the inference_model.json file is in the same folder as your ONNX files
Check that the model name in the JSON file matches what you're trying to run

Problem: Model starts but gives weird responses

What happened: The chat template might not be set up correctly How to fix it:

Check your inference_model.json file for typos
Make sure the special characters (<|im_start|>, <|im_end|>) are exactly as shown
Try removing the /think part if you're getting strange outputs

Problem: Model runs very slowly

What happened: Your computer might be working harder than it needs to How to fix it: