Blog Post

Apps on Azure Blog
4 MIN READ

Build an AI Image-Caption Generator on Azure App Service with Streamlit and GPT-4o-mini

TulikaC's avatar
TulikaC
Icon for Microsoft rankMicrosoft
Sep 02, 2025

This tiny app just does one thing: upload an image → get a natural one-line caption. Under the hood:

  • Azure AI Vision extracts high-confidence tags from the image.
  • Azure OpenAI (GPT-4o-mini) turns those tags into a fluent caption.
  • Streamlit provides a lightweight, Python-native UI so you can ship fast.

All code + infra templates: image_caption_app in the App Service AI Samples repo: https://github.com/Azure-Samples/appservice-ai-samples/tree/main/image_caption_app

What are these components?

  • What is Streamlit? An open-source Python framework to build interactive data/AI apps with just a few lines of code—perfect for quick, clean UIs.
  • What is Azure AI Vision (Vision API)? A cloud service that analyzes images and returns rich signals like tags with confidence scores, which we use as grounded inputs for captioning.

How it works (at a glance)

  1. User uploads a photo in Streamlit.
  2. The app calls Azure AI Vision → gets a list of tags (keeps only high-confidence ones).
  3. The app sends those tags to GPT-4o-mini → generates a one-line caption.
  4. Caption is shown instantly in the browser.

Prerequisites

Resources you’ll deploy

You can create everything manually or with the provided azd template.

What you need

  • Azure App Service (Linux) to host the Streamlit app.
  • Azure AI Foundry/OpenAI with a gpt-4o-mini deployment for caption generation.
  • Azure AI Vision (Computer Vision) for image tagging.
  • Managed Identity enabled on the Web App, with RBAC grants so the app can call Vision and OpenAI without secrets.

One-command deploy with azd (recommended)
The sample includes infra under image_caption_app/infra so azd up can provision + deploy in one go.

# 1) Clone and move into the sample 
git clone https://github.com/Azure-Samples/appservice-ai-samples 
cd appservice-ai-samples/image_caption_app 

# 2) Log in and provision + deploy 
azd auth login 
azd up

Manual path (if you prefer doing it yourself)

  1. Create Azure AI Vision, note the endpoint (custom subdomain).
  2. Create Azure AI Foundry/OpenAI and deploy gpt-4o-mini.
  3. Create App Service (Linux, Python) and enable System-Assigned Managed Identity.
  4. Assign roles to the Web App’s Managed Identity:
    • Cognitive Services OpenAI User on your OpenAI resource.
    • Cognitive Services User on your Vision resource.
  5. Add app settings for endpoints and deployment names (see repo), deploy the code, and run.

Startup command (manual setting):
If you’re configuring the Web App yourself (instead of using the Bicep), set the Startup Command to:

streamlit run app.py --server.port 8000 --server.address 0.0.0.0

Portal path: App Service → Configuration → General settings → Startup Command.
CLI example:

az webapp config set \ 
--name <your-webapp-name> \ 
--resource-group <your-rg> \ 
--startup-file "streamlit run app.py --server.port 8000 --server.address 0.0.0.0"

(The provided Bicep template already sets this for you.)

Code tour (the important bits)

Top-level flow (app.py)
First we get tags from Vision, then ask GPT-4o-mini for a one-liner:

tags = extract_tags(image_bytes)
caption = generate_caption(tags)

Vision call (utils/vision.py)
Call the Vision REST API, parse JSON, and keep high-confidence tags (> 0.6):

response = requests.post(
    VISION_API_URL,
    headers=headers,
    params=PARAMS,
    data=image_bytes,
    timeout=30,
)
response.raise_for_status()
analysis = response.json()

tags = [
    t.get('name')
    for t in analysis.get('tags', [])
    if t.get('name') and t.get('confidence', 0) > 0.6
]

Caption generation (utils/openai_caption.py)
Join tags and ask GPT-4o-mini for a natural caption:

tag_text = ", ".join(tags)
prompt = f"""
You are an assistant that generates vivid, natural-sounding captions for images.
Create a one-line caption for an image that contains the following: {tag_text}.
"""

response = client.chat.completions.create(
    model=DEPLOYMENT_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": prompt.strip()}
    ],
    max_tokens=60,
    temperature=0.7
)
return response.choices[0].message.content.strip()

Security & auth: Managed Identity by default (recommended)

This sample ships to use Managed Identity on App Service—no keys in config.

  • The Web App’s Managed Identity authenticates to Vision and Azure OpenAI via Microsoft Entra ID.
  • Prefer Managed Identity in production; if you need to test locally, you can switch to key-based auth by supplying the service keys in your environment.

Run it locally (optional)

# From the sample folder
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Set env vars for endpoints + deployment (and keys if not using MI locally)
streamlit run app.py

Repo map

  • App + Streamlit UI + helpers: image_caption_app/
  • Bicep infrastructure (used by azd up): image_caption_app/infra/

What’s next — ways to extend this sample

  • Richer vision signals: Add object detection, OCR, or brand detection; blend those into the prompt for sharper captions.
  • Persistence & gallery: Save images to Blob Storage and captions/metadata to Cosmos DB or SQLite; add a Streamlit gallery.
  • Performance & cost: Cache tags by image hash; cap image size; track tokens/latency.
  • Observability: Wire up Application Insights with custom events (e.g., caption_generated).

Looking for more Python samples? Check out the repo: https://github.com/Azure-Samples/appservice-ai-samples/tree/main

For more Azure App Service AI samples and best practices, check out the Azure App Service AI integration documentation

Published Sep 02, 2025
Version 1.0
No CommentsBe the first to comment