Educator Developer Blog

13 MIN READ

Recipe Generator Application with Phi-3 Vision on AI Toolkit Locally

Iron Contributor

Jan 28, 2025

Images surround us in today's digital world, from social media to medical scans. But how do we unlock and use the information within them using Generative AI? This blog post will guide you through developing a Recipe Generator Application using the cutting-edge SLM Phi-3 Vision model. It will demonstrate on how to do this completely free and within a local environment, on-premises using the VS Code AI Toolkit! Get ready to explore the exciting possibilities of Generative AI !

In today's data-driven world, images have become a ubiquitous source of information. From social media feeds to medical imaging, we encounter and generate images constantly. Extracting meaningful insights from these visual data requires sophisticated analysis techniques. In this blog post let’s build an Image Analysis Application using the cutting-edge Phi-3 Vision model completely free of cost and on-premise environment using the VS Code AI Toolkit. We'll explore the exciting possibilities that this powerful combination offers.

The AI Toolkit for Visual Studio Code (VS Code) is a VS Code extension that simplifies generative AI app development by bringing together cutting-edge AI development tools and models. I would recommend going through the following blogs for getting started with VS Code AI Toolkit.

1. Visual Studio Code AI Toolkit: How to Run LLMs locally

2. Visual Studio AI Toolkit : Building Phi-3 GenAI Applications

3. Building Retrieval Augmented Generation on VSCode & AI Toolkit

4. Bring your own models on AI Toolkit - using Ollama and API keys

Setup VS Code AI Toolkit:

Launch the VS Code application and Click on the VS Code AI Toolkit extension. Login to the GitHub account if not already done. Once ready, click on model catalog. In the model catalog there are a lot of models, broadly classified into two categories,

Local Run (with CPU and with GPU)
Remote Access (Hosted by GitHub and other providers)

Visual Studio Code AI Toolkit: Model Catalog

For this blog, we will be using a Local Run model. This will utilize the local machine’s hardware to run the Language model. Since it involves analyzing images, we will be using the language model which supports vision operations and hence Phi-3-Vision will be a good fit as its light and supports local run. Download the model and then further it will be loaded it in the playground to test.

Visual Studio Code AI Toolkit: Phi-3 Vision Model

Once downloaded, Launch the “Playground” tab and load the Phi-3 Vision model from the dropdown. The Playground also shows that Phi-3 vision allows image attachments. We can try it out before we start developing the application.

Visual Studio Code AI Toolkit: Model Selection and Playground

Let’s upload the image using the “Paperclip icon” on the UI. I have uploaded image of Microsoft logo and prompted the language model to Analyze and explain the image.

Visual Studio Code AI Toolkit: Playground completions

Phi-3 vision running on local premise boasts an uncanny ability to not just detect but unerringly pinpoint the exact Company logo and decipher the name with astonishing precision. This is a simple use case, but it can be built upon with various applications to unlock a world of new possibilities.

Port Forwarding:

Port Forwarding, a valuable feature within the AI Toolkit, serves as a crucial gateway for seamless communication with the GenAI model. To do this, launch the terminal and navigate to the “Ports” section. There will be button “Forward a Port”, click on that and select any desired port, in this blog we will use 5272 as the port.

Visual Studio Code AI Toolkit: Port ForwardingVisual Studio Code AI Toolkit: Port

The Model-as-a-server is now ready, where the model will be available on the port 5272 to respond to the API calls. It can be tested with any API testing application. To know more click here.

Creating Application with Python using OpenAI SDK:

To follow this section, Python must be installed on the local machine. Launch the new VS Code window and set the working directory. Create a new Python Virtual environment. Once the setup is ready, open the terminal on VS Code, and install the libraries using “pip”.

pip install openai
pip install streamlit

Before we build the streamlit application, lets develop the basic program and check the responses in the VSCode terminal and then further develop a basic webapp using the streamlit framework.

Basic Program

Import libraries:

import base64
from openai import OpenAI

base64: The base64 module provides functions for encoding binary data to base64-encoded strings and decoding base64-encoded strings back to binary data. Base64 encoding is commonly used for encoding binary data in text-based formats such as JSON or XML.

OpenAI: The OpenAI package is a Python client library for interacting with OpenAI's API. The OpenAI class provides methods for accessing various OpenAI services, such as generating text, performing natural language processing tasks, and more.

Initialize Client:

Initialize an instance of the OpenAI class from the openai package,

client = OpenAI(
    base_url="http://127.0.0.1:5272/v1/",
    api_key="xyz" # required by API but not used
)

OpenAI (): Initializes a OpenAI model with specific parameters, including a base URL for the API, an API key, a custom model name, and a temperature setting. This model is used to generate responses based on user queries. This instance will be used to interact with the OpenAI API.

base_url = "http://127.0.0.1:5272/v1/": Specifies the base URL for the OpenAI API. In this case, it points to a local server running on 127.0.0.1 (localhost) at port 5272.
api_key = "ai-toolkit": The API key used to authenticate requests to the OpenAI API. In case of AI Toolkit usage, we don’t have to specify any API key.

The image analysis application will frequently deal with images uploaded by users. But to send these images to GenAI model, we need them in a format it understands. This is where the encode_image function comes in.

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

Function Definition:

def encode_image(image_path): defines a function named encode_image that takes a single argument, image_path. This argument represents the file path of the image we want to encode.

Opening the Image:

with open(image_path, "rb") as image_file: opens the image file specified by image_path in binary reading mode ("rb"). This is crucial because we're dealing with raw image data, not text.

Reading Image Content:

image_file.read() reads the entire content of the image file into a byte stream. Remember, images are stored as collections of bytes representing color values for each pixel.

Base64 Encoding:

base64.b64encode(image_file.read()) encodes the byte stream containing the image data into base64 format. Base64 encoding is a way to represent binary data using a combination of printable characters, which makes it easier to transmit or store the data.

Decoding to UTF-8:

.decode("utf-8") decodes the base64-encoded data into a UTF-8 string. This step is necessary because the OpenAI API typically expects text input, and the base64-encoded string can be treated as text containing special characters.

Returning the Encoded Image:

return returns the base64-encoded string representation of the image. This encoded string is what we'll send to the AI model for analysis.

In essence, the encode_image function acts as a bridge, transforming an image file on your computer into a format that the AI model can understand and process.

Path for the Image:

We will use an image stored on our local machine for this section, while we develop the webapp, we will change this to accept it to what the user uploads.

image_path = "C:/img.jpg" #path of the image here

This line of code is crucial for any program that needs to interact with an image file. It provides the necessary information for the program to locate and access the image data.

Base64 String:

# Getting the base64 string
base64_image = encode_image(image_path)

This line of code is responsible for obtaining the base64-encoded representation of the image specified by the image_path. Let's break it down:

encode_image(image_path): This part calls the encode_image function, which we've discussed earlier. This function takes the image_path as input and performs the following:

Reads the image file from the specified path.
Converts the image data into a base64-encoded string.
Returns the resulting base64-encoded string.

base64_image = ...: This part assigns the return value of the encode_image function to the variable base64_image.

This section effectively fetches the image from the given location and transforms it into a special format (base64) that can be easily handled and transmitted by the computer system. This base64-encoded string will be used subsequently to send the image data to the AI model for analysis.

Invoking the Language Model:

This code tells the AI model what to do with the image.

response = client.chat.completions.create(
    model="Phi-3-vision-128k-cpu-int4-rtn-block-32-acc-level-4-onnx",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in the Image?",
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
            ],
        }
    ],
)

response = client.chat.completions.create(...): This line sends instructions to the AI model we're using (represented by client). Here's a breakdown of what it's telling the model:

chat.completions.create: We're using a specific part of the OpenAI API designed for having a conversation-like interaction with the model.

The ... part: This represents additional details that define what we want the model to do, which we'll explore next.

Let's break down the details (...) sent to the model:

1) model="Phi-3-vision-128k-cpu-int4-rtn-block-32-acc-level-4-onnx": This tells the model exactly which AI model to use for analysis. In our case, it's the "Phi-3-vision" model.

2) messages: This defines what information we're providing to the model. Here, we're sending two pieces of information:

role": "user": This specifies that the first message comes from a user (us).

The content: This includes two parts:

"What's in the Image?": This is the prompt we're sending to the model about the image.

"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}: This sends the actual image data encoded in base64 format (stored in base64_image).

In a nutshell, this code snippet acts like giving instructions to the AI model. We specify the model to use, tell it we have a question about an image, and then provide the image data itself.

Printing the response on the console:

print(response.choices[0].message.content)

We asked the AI model "What's in this image?" This line of code would then display the AI's answer.

Console response:

Visual Studio Code: Terminal Response

Finally, we can see the response on the terminal. Now to make things more interesting, let’s convert this into a webapp using the streamlit framework.

Recipe Generator Application with Streamlit:

Now we know how to interact with the Vision model offline using a basic console. Let’s make things even more exciting by applying all this to a use-case which probably will be most loved by all those who are cooking enthusiasts!! Yes, let’s create an application which will assist in cooking by looking what’s in the image of ingredients!

Create a new file and name is as “app.py” select the same. venv that was used earlier. Make sure the Visual studio toolkit is running and serving the Phi-3 Vision model through the port 5272.

First step is importing the libraries,

import streamlit as st
import base64
from openai import OpenAI

base64 and OpenAI is the same as we had used in the earlier section.

Streamlit: This part imports the entire Streamlit library, which provides a powerful set of tools for creating user interfaces (UIs) with Python. Streamlit simplifies the process of building web apps by allowing you to write Python scripts that directly translate into interactive web pages.

client = OpenAI(
  base_url="http://127.0.0.1:5272/v1/",
  api_key="xyz" # required by API but not used
)

As discussed in the earlier section, initializing the client and configuring the base_url and api_key.

st.title('Recipe Generator 🍔')
st.write('This is a simple recipe generator application.Upload images of the Ingridients and get the recipe by Chef GenAI! 🧑‍🍳')
uploaded_file = st.file_uploader("Choose a file")
if uploaded_file is not None:
  st.image(uploaded_file, width=300)

st.title('Recipe Generator 🍔'): This line sets the title of the Streamlit application as "Recipe Generator" with a visually appealing burger emoji.
st.write(...): This line displays a brief description of the application's functionality to the user.
uploaded_file = st.file_uploader("Choose a file"): This creates a file uploader component within the Streamlit app. Users can select and upload an image file (likely an image of ingredients).
if uploaded_file is not None: : This conditional block executes only when the user has actually selected and uploaded a file.
st.image(uploaded_file, width=300): If an image is uploaded, this line displays the uploaded image within the Streamlit app with a width of 300 pixels.

In essence, this code establishes the basic user interface for the Recipe Generator app. It allows users to upload an image, and if an image is uploaded, it displays the image within the app.

preference = st.sidebar.selectbox(
    "Choose your preference",
    ("Vegetarian", "Non-Vegetarian")
)

cuisine = st.sidebar.selectbox(
  "Select for Cuisine",
  ("Indian","Chinese","French","Thai","Italian","Mexican","Japanese","American","Greek","Spanish")
)

We use Streamlit's sidebar and selectbox features to create interactive user input options within a web application:

st.sidebar.selectbox(...): This line creates a dropdown menu (selectbox) within the sidebar of the Streamlit application.The first argument, "Choose your preference", sets the label or title for the dropdown.The second argument, ("Vegetarian", "Non-Vegetarian"), defines the list of options available for the user to select (in this case, dietary preferences).
cuisine = st.sidebar.selectbox(...): This line creates another dropdown menu in the sidebar, this time for selecting the desired cuisine.The label is "Select for Cuisine".The options provided include "Indian", "Chinese", "French", and several other popular cuisines.

In essence, this code allows users to interact with the application by selecting their preferred dietary restrictions (Vegetarian or Non-Vegetarian) and desired cuisine from the dropdown menus in the sidebar.

def encode_image(uploaded_file):
  """Encodes a Streamlit uploaded file into base64 format"""
  if uploaded_file is not None:
    content = uploaded_file.read()
    return base64.b64encode(content).decode("utf-8")
  else:
    return None

base64_image = encode_image(uploaded_file)

The same function of encode_image as discussed in the earlier section is being used here.

if st.button("Ask Chef GenAI!"):
  if base64_image:
    response = client.chat.completions.create(
    model="Phi-3-vision-128k-cpu-int4-rtn-block-32-acc-level-4-onnx",
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": f"STRICTLY use the ingredients in the image to generate a {preference} recipe and {cuisine} cuisine.",
          },
          {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
          },
        ],
      }
    ],
  )
    print(response.choices[0].message.content)
    st.write(response.choices[0].message.content)
else:
  st.write("Please upload an image with any number of ingridients and instantly get a recipe.")

Above code block implements the core functionality of the Recipe Generator app, triggered when the user clicks a button labeled "Ask Chef GenAI!":

if st.button("Ask Chef GenAI!"): This line checks if the user has clicked the button. If they have, the code within the if block executes.

if base64_image: This inner if condition checks if a variable named base64_image has a value. This variable likely stores the base64 encoded representation of the uploaded image (containing ingredients). If base64_image has a value (meaning an image is uploaded), the code proceeds.

client.chat.completions.create(...): Client that had been defined earlier interacts with the API . Here, it calls a to generate text completions, thereby invoking a small language model. The arguments provided specify the model to be used ("Phi-3-vision-128k-cpu-int4-rtn-block-32-acc-level-4-onnx") and the message to be completed.

The message consists of two parts within a list:

User Input: The first part defines the user's role ("user") and the content they provide. This content is an instruction with two key points:

Dietary Preference: It specifies to "STRICTLY use the ingredients in the image" to generate a recipe that adheres to the user's preference (vegetarian or non-vegetarian, set using the preference dropdown).
Cuisine Preference: It mentions the desired cuisine type (Indian, Chinese, etc., selected using the cuisine dropdown).

Image Data: The second part provides the image data itself. It includes the type ("image_url") and the URL, which is constructed using the base64_image variable containing the base64 encoded image data.

print(response.choices[0].message.content) & st.write(...): The response will contain a list of possible completions. Here, the code retrieves the first completion (response.choices[0]) and extracts its message content. This content is then printed to the console like before and displayed on the Streamlit app using st.write.

else block: If no image is uploaded (i.e., base64_image is empty), the else block executes. It displays a message reminding the user to upload an image to get recipe recommendations.

The above code block is the same as before except the we have now modified it to accept few inputs and also have made it compatible with streamlit.

The coding is now completed for our streamlit application! It's time to test the application. Navigate to the terminal on Visual Studio Code and enter the following command, (if the file is named as app.py)

streamlit run app.py

Upon successful run, it will redirect to default browser and a screen with the Recipe generator will be launched,

Recipe Generator

Upload an image with ingredients, select the recipe, cuisine and click on “Ask Chef GenAI”. It will take a few moments for delightful recipe generation.

While generating we can see the logs on the terminal and finally the recipe will be shown on the screen!

Recipe Generator Response

Enjoy your first recipe curated by Chef GenAI powered by Phi-3 vision model on local prem using Visual Studio AI Toolkit! The code is available on the following GitHub Repository.

In the upcoming series we will explore more types of Gen AI implementations with AI toolkit.