azure ai

302 Topics

Set Up Plaud Note Pro with Microsoft Foundry
Prerequisites Riffado, up and running: follow the setup guide in the official Riffado repository to get it going with Docker Compose. A Microsoft Foundry (formerly Azure AI Foundry) resource, with the models you want deployed; in my case, whisper for transcription and o3-mini for summaries. A Plaud device, or any audio recordings you can import into Riffado. Once Riffado is up, head to the Settings page > Providers > Add Provider, and select Custom. This is where the Azure details will go. Why "OpenAI-compatible" isn’t one thing on Microsoft Foundry Azure AI Foundry exposes two different API surfaces on the same resource, and which one serves your model depends on the model: Surface Path shape Serves OpenAI-compatible? v1 route /openai/v1/… gpt-4o-transcribe, gpt-4o-mini-transcribe, chat models, embeddings Yes: Bearer auth, model in the body, no api-version needed Classic route /openai/deployments/{name}/… Whisper (and other legacy audio) No: deployment name lives in the URL, and ?api-version= is mandatory A generic OpenAI client (Riffado's included) can only speak the first dialect. It has nowhere to put a deployment name in the path and no way to append a query parameter. That single fact drives everything below. Part 1 - Transcription Whisper and the DeploymentNotFound mystery Symptom My very first transcription attempt in Riffado failed with 404 Resource not found. Off to a flying start. Configured provider: base URL https://<resource>.services.ai.azure.com, model whisper. Dead end #1: the missing path The first bug was mine: the base URL had no path. Riffado's OpenAI client appends /audio/transcriptions to whatever you give it, so requests were hitting https://<resource>…/audio/transcriptions, a path that doesn't exist on the resource at all. Fixing the base URL to end in /openai/v1 got us to a more interesting error: POST /openai/v1/audio/transcriptions · model=whisper {"error":{"code":"DeploymentNotFound","message":"The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again."}} Dead end #2: catalog ≠ deployment Worth checking before anything else: selecting a model in the Foundry catalog is not deploying it. GET /openai/v1/models lists everything you could deploy; only Deployments → Deploy model creates an endpoint that answers. If you get DeploymentNotFound, first confirm a deployment actually exists (the listing below requires only the API key): enumerate real deployments (classic control-plane, key auth) curl -s -H "api-key: $KEY" \ "https://<resource>.openai.azure.com/openai/deployments?api-version=2023-03-15-preview" # → {"data":[{"id":"whisper","model":"whisper","status":"succeeded",…}]} The actual cause Here is the part that nearly drove me mad: the deployment existed and was succeeded, yet the v1 route still said DeploymentNotFound. Because Whisper deployments are not served on the v1 route at all. They only answer on the classic path. Verified side by side with the same tiny WAV file: Request Result POST /openai/v1/audio/transcriptions · model=whisper · Bearer 404 DeploymentNotFound POST /openai/deployments/whisper/audio/transcriptions?api-version=2024-06-01 · Bearer 200 {"text":"you"} Same classic path, without ?api-version= 404 Resource not found Three constraints, then: Whisper needs the classic path; the classic path needs api-version; Riffado can send neither. One piece of good news hiding in the table: the classic route accepts Authorization: Bearer, not just Azure's api-key header, so the shim doesn't have to touch auth at all. The fix: a Caddy shim Drop a stock caddy:2-alpine container into the Compose network. Riffado points at it as if it were OpenAI; the shim rewrites the path, injects api-version, and proxies to Azure. The Bearer header passes through untouched. azure-shim.Caddyfile { admin off auto_https off } :80 { @transcribe path /v1/audio/transcriptions /audio/transcriptions handle @transcribe { rewrite * /openai/deployments/whisper/audio/transcriptions?api-version=2024-06-01 reverse_proxy https://<resource>.services.ai.azure.com { header_up Host <resource>.services.ai.azure.com } } handle { respond "azure-shim ok" 200 } } docker-compose.yml (added service) azure-shim: image: caddy:2-alpine restart: unless-stopped volumes: - ./azure-shim.Caddyfile:/etc/caddy/Caddyfile:ro Riffado's provider settings become: Field Value Base URL http://azure-shim/v1 Model whisper (must equal the deployment name) API key the Azure resource key (forwarded as Bearer) Verified From inside the Riffado container: POST http://azure-shim/v1/audio/transcriptions → 200 {"text":"…"}. Transcription works end-to-end in the UI. Part 2 · Summaries & titles o3-mini and the empty answer Symptom The summary button showed "An unexpected error occurred." The container logs were more honest: riffado-app logs Error generating title: TypeError: undefined is not an object (evaluating 'C.choices[0]') Riffado calls chat/completions and reads choices[0] without checking whether the response was an error. So anything the API refuses becomes "an unexpected error." What was it refusing? Cause 1: reasoning models reject the classic knobs o3-mini belongs to Azure/OpenAI's o-series reasoning models, which hard-reject parameters every classic chat client sends. Riffado sends temperature: 0.7 and max_tokens: 50 for titles (0.5 / 2000 for summaries), and o3-mini answers: POST /openai/v1/chat/completions · model=o3-mini HTTP 400 {"error":{"message":"Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", …}} # and with max_tokens fixed: HTTP 400 {"error":{"message":"Unsupported parameter: 'temperature' is not supported with this model.", …}} Cause 2: reasoning tokens starve the output Stripping the bad params gets you to 200, and then comes a subtler failure, my personal favourite of this whole saga. Reasoning models spend completion tokens on internal "thinking" before emitting a single visible character. Riffado's 50-token title budget is consumed entirely by reasoning, and the reply comes back syntactically valid and empty: max_completion_tokens reasoning_effort finish_reason content 50 not set length "" (all 50 spent reasoning) 2000 not set stop "Q3 Budget Planning Strategy Meeting" 2000 low stop same, less reasoning overhead The fix: a Node shim that rewrites the request body Caddy can rewrite paths but not JSON bodies, so this shim is ~60 lines of dependency-free Node on node:20-alpine. Per request it: converts max_tokens → max_completion_tokens, strips temperature / top_p / penalties, floors the token budget at 4000, sets reasoning_effort: "low", maps /v1/* → /openai/v1/*, and forwards to the Azure resource. o3-shim.js const http = require('http'); const https = require('https'); const UPSTREAM_HOST = '<resource>.services.ai.azure.com'; // Params o-series reasoning models reject on chat/completions. const STRIP = ['temperature','top_p','presence_penalty', 'frequency_penalty','logprobs','top_logprobs']; const server = http.createServer((req, res) => { const chunks = []; req.on('data', c => chunks.push(c)); req.on('end', () => { let body = Buffer.concat(chunks); // Riffado's base_url is http://o3-shim/v1 → map to Azure's /openai/v1 let path = req.url; if (path.startsWith('/v1/')) path = '/openai' + path; const ct = (req.headers['content-type'] || '').toLowerCase(); if (ct.includes('application/json') && body.length) { try { const j = JSON.parse(body.toString('utf8')); if (j && typeof j === 'object' && !Array.isArray(j)) { if ('max_tokens' in j) { if (!('max_completion_tokens' in j)) j.max_completion_tokens = j.max_tokens; delete j.max_tokens; } // Reasoning spends tokens before any visible output; small // budgets (Riffado sends 50 for titles) return empty strings. if (Array.isArray(j.messages)) { j.max_completion_tokens = Math.max(Number(j.max_completion_tokens) || 0, 4000); if (!('reasoning_effort' in j)) j.reasoning_effort = 'low'; } for (const k of STRIP) delete j[k]; body = Buffer.from(JSON.stringify(j)); } } catch (_) { /* not JSON - forward untouched */ } } const headers = { ...req.headers, host: UPSTREAM_HOST, 'content-length': Buffer.byteLength(body) }; const up = https.request( { host: UPSTREAM_HOST, port: 443, method: req.method, path, headers }, upRes => { res.writeHead(upRes.statusCode, upRes.headers); upRes.pipe(res); } ); up.on('error', e => { res.writeHead(502, {'content-type':'application/json'}); res.end(JSON.stringify({error:{message:'o3-shim upstream error: '+e.message}})); }); up.end(body); }); }); server.listen(80, () => console.log('o3-shim listening on :80')); docker-compose.yml (added service) o3-shim: image: node:20-alpine restart: unless-stopped working_dir: /app command: ["node", "/app/o3-shim.js"] volumes: - ./o3-shim.js:/app/o3-shim.js:ro Add a second provider in Riffado (base URL http://o3-shim/v1, model o3-mini, the resource's API key) and set it as the default enhancement provider (summaries/titles), keeping the Whisper one as default for transcription. Riffado's exact title request (temperature: 0.7, max_tokens: 50) through the shim → 200, finish_reason: stop, real title text. A full meeting-transcript summary returns structured key points and action items. The final shape Reading it left to right: Riffado never talks to Azure directly. Transcription requests pass through azure-shim, a stock Caddy container that rewrites each request onto Whisper's classic deployment path and injects the mandatory api-version parameter. Summary and title requests pass through o3-shim, a tiny Node server that rewrites the request body into the shape o3-mini accepts and floors the token budget so the model's internal reasoning cannot starve the actual answer. As far as Riffado is concerned, it is simply talking to two ordinary OpenAI providers. Both shims live on the Compose network only; nothing is exposed publicly. Riffado is unmodified. Verification checklist Each layer, testable in isolation. Run these before blaming the app: smoke tests # 1. Key + resource alive? (v1 models listing, Bearer auth) curl -s -H "Authorization: Bearer $KEY" \ https://<resource>.services.ai.azure.com/openai/v1/models | head -c 200 # 2. Whisper answers on the classic path? curl -s -H "Authorization: Bearer $KEY" -F file=@test.wav \ "https://<resource>.services.ai.azure.com/openai/deployments/whisper/audio/transcriptions?api-version=2024-06-01" # 3. Shim translates correctly? (from inside the compose network) docker exec riffado-app node -e "fetch('http://azure-shim/') .then(r=>r.text()).then(console.log)" # 4. o3-mini via shim, sending the params Riffado sends? # (temperature + max_tokens:50; the shim must absorb both) If you'd rather not run shims Both shims exist because of the specific models chosen. Pick models that live natively on the v1 route and Riffado connects directly, with base URL https://<resource>.services.ai.azure.com/openai/v1 and zero extra containers: Transcription: deploy gpt-4o-mini-transcribe (or gpt-4o-transcribe) instead of Whisper. Summaries: deploy a non-reasoning chat model such as gpt-4o-mini, which happily accepts temperature and max_tokens. The shim approach earns its keep when you're standardized on specific models (Whisper's transcription quality, o3-mini's reasoning), or when you want a control point to add logging, retries, or budget caps later. For reference, this is what the finished setup looks like on Riffado's side. Each shim is registered as a plain Custom provider. Here is the whisper provider pointing at azure-shim, with Use for transcription ticked: And once both are saved, they sit side by side in the providers list, whisper tagged for transcription and o3-mini tagged for enhancement: A quick look at the Foundry portal In the Microsoft Foundry portal, head over to Models > AI Services and you will find a pleasant surprise: fifteen AI service models already deployed and ready to use, covering the Azure Speech family (including Voice Live and Speech to Text), Azure Translator, Azure Language, and Content Understanding: You can of course deploy another model for this, but the pre-deployed ones are a handy cost-saving option. Click on the Azure Speech – Voice Live radio button and you will be shown the Base URL and API Key, which you can then paste into the provider settings on Riffado's Settings page. A quick note on cost: these services are not free. They are billed pay-as-you-go based on usage. Azure Speech transcription is charged per audio hour, and Voice Live pricing is tiered by the model you choose. The free tier does include a monthly allowance, though. Check the Azure Speech pricing page before committing. And if you would rather deploy a dedicated transcription model such as whisper, Foundry gives you the flexibility to do just that. Open the model page in the catalogue, click Deploy, and go with Default settings unless you need custom quotas or guardrails: Let's test the setup On your Plaud device, just tap to start recording. The little LED bars light up to show it is listening: Or skip the device entirely and upload an audio file straight into Riffado using the Upload Audio button. Either way, the recording lands on the Recordings page; hit Transcribe and let the spinner do its thing: As you can see below, whisper, the transcription model we deployed earlier, even managed to transcribe a recording in Malay without a hitch. My 3:32 test clip came back as 186 words of clean Malay, with the language correctly detected and tagged: I have also set o3-mini as the enhancement provider, and it enhanced the transcription with a proper summary, key points, and title as well! The Meeting Notes-style summary came straight out of o3-mini through the shim, with zero manual prompting. Wrapping up What started as a TikTok-fuelled impulse buy nearly killed off by subscription pricing ended up as a fully self-hosted pipeline: Plaud for recording, Riffado as the interface, and Microsoft Foundry serving whisper and o3-mini behind two tiny shims. The total extra infrastructure came to two containers and roughly sixty lines of code, and not a single monthly subscription in sight. If you try this setup and run into a failure mode I have not covered here, do share it in the comments. Half the fun is in the debugging.
suzarilshah
Jul 22, 2026 Place Educator Developer Blog
31Views
0likes
0Comments
Pantone’s Palette Generator enhances creative exploration with agentic AI on Azure
Fixing tags
mtoiba
Jul 17, 2026 Place Customer Innovation Blog
1.5KViews
1like
0Comments
o3-deep-research is failed with the status incomplete with the reason as content filter
I working on an to do an deep research on internal data. I'm using currently the Azure OpenAI Responses API with MCP Tool. The underlying MCP server deployed into ACA with search and fetch tool with signatures in complaint with the specification (https://developers.openai.com/apps-sdk/build/mcp-server#company-knowledge-compatibility). OpenAI client created with 03-deep-research model with MCP tool, in a loop response status being checked. (https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/deep-research#remote-mcp-server-with-deep-research) Deep Research is being carried out for sometime, I could see in the log that handshake has been made, ListTools invoked, search tool is called post that fetch is called for the queries framed by the model.. But intermittently, the response status is becoming "incomplete" with incomplete reason as "content_filter". Otherwise the deep research is working fine. Not able identify the root cause as there is seems to be no way to identify what caused the content filtration whether its the prompt or completion. How to debug and check the root cause and rectify this ? Or is there known issue with the o3-deep-research model's intermediate reasoning completions Or search and fetch tool results are causing this ? I had uploaded a file made it available to MCP server, the search and fetch tool uses an Azure OpenAI agent to search the data using File Search and fetch tool gets the content of the file based on the id passed. For same file and same research topic the issue is not occurring always but intermittently.
Murugates
Jul 16, 2026 Place Microsoft Foundry Discussions
166Views
0likes
1Comment
GPT-5.5-Pro not listed in foundry?
The model is mentioned in this blog post : https://azure.microsoft.com/en-us/blog/openais-gpt-5-5-in-microsoft-foundry-frontier-intelligence-on-an-enterprise-ready-platform/ But it is currently not listed on Foundry. Only latest pro model is 5.4-pro. When will 5.5-pro model be available on azure foundry?
hr718
Jul 16, 2026 Place Microsoft Foundry Discussions
218Views
0likes
1Comment
Migrating to GPT-5.x Without Breaking GPT-4: A Practical, Backward-Compatible Playbook
The first request your service sends after swapping gpt-4o for gpt-5.1 in production will return HTTP 400. Not in two weeks. On the first call. And the parameter the error points to isn't one you set anywhere in your code - it's bound onto the request by a LangChain helper you've used for two years. This post walks through every breaking change between the GPT-4 and GPT-5 families on Azure OpenAI in Microsoft Foundry, the integration cliffs nobody warns you about, and the small set of files you need so the same call sites work against both model families without branching. Who this is for: engineers maintaining an existing production codebase that calls Azure OpenAI / OpenAI - directly or through LangChain - and needs to onboard GPT-5.x while keeping the GPT-4 deployments alive during rollout. What you'll leave with: one copy-paste compatibility module, a tiny LangChain subclass, a prompt-audit harness, and a 10-step rollout checklist. 1. Why this migration is different Every previous Azure OpenAI bump - 3.5 → 4, 4 → 4o, 4o → 4o-mini - was additive. You changed engine="gpt-4o" and everything kept working. GPT-5.x is the first generation that is subtractive: parameters you used to send now return 400 Unsupported parameter. The wire protocol itself changed because GPT-5 is a reasoning model - it spends tokens thinking internally before it answers, so the parameters that controlled the old sampling pipeline (temperature, top_p, presence_penalty, frequency_penalty) no longer exist on the request schema. What this means for production code: A passing test suite against gpt-4o will fail on the first call against gpt-5.1 with HTTP 400. A passing test suite against gpt-5.1 will fail on every legacy gpt-4* deployment because the new reasoning controls (reasoning_effort, verbosity) are not recognised there. LangChain helpers that worked unmodified for two years (notably create_sql_query_chain) silently bind stop=[...] onto your LLM and trigger the same 400. Source-grep won't find the offending line because it lives inside the library. The good news: the divergence is mechanical. With one detection helper, one parameter-builder, and one tiny LangChain subclass you can run the same code against both families. 2. The breaking-changes matrix Concern GPT-4 / GPT-4o (legacy) GPT-5.x / o1 / o3 (reasoning) Output budget max_tokens max_completion_tokens (rejects max_tokens) temperature 0.0–1.0 Only the default (1) is accepted - omit it top_p Supported Rejected presence_penalty, frequency_penalty Supported Rejected logprobs, logit_bias Supported Rejected stop sequences Supported Rejected on most reasoning deployments reasoning_effort Rejected New: minimal | low | medium | high verbosity Rejected New: low | medium | high (sometimes via extra_body) System instruction role system developer recommended; system still works as alias Output token cost Output tokens only Output + reasoning tokens count against your cap Recommended API version 2024-12-01-preview or earlier 2025-03-01-preview or later Two consequences are easy to miss: max_completion_tokens is a shared budget. GPT-5.1 can burn 2–4× more tokens internally before emitting the first response token. A cap of 4096 that comfortably held a SQL query on GPT-4o now silently truncates the answer mid-token on GPT-5.1. Multiply your legacy budgets by ~2.5× and add a floor (e.g. 4096) before sending. The stop parameter is the silent killer. Any helper that calls llm.bind(stop=[...]) - and there are several in langchain - will turn a working code path into a 400 the moment you swap deployments. 3. Compatibility strategy: detect, don't fork The temptation is to fork: one branch for GPT-4, one for GPT-5. Don't. The right unit of abstraction is one function that classifies the deployment into a family, and one function that builds a kwargs dict the SDK will accept for that family. Every call site - SDK, LangChain, raw HTTP - drains into the same kwargs builder. When you eventually retire GPT-4 you delete the legacy branch in one file, not in fifty. 4. The industry-agnostic compatibility module Drop the following file into your project. It has no Azure / OpenAI / LangChain imports at module load time, so the same file works from a web service, a serverless function, a notebook, or a CLI tool. 4.1 model_compat.py """ Model compatibility helper for GPT-5.x with GPT-4 backward compatibility. This module centralises the parameter translation needed to talk to the "reasoning" generation of OpenAI / Azure OpenAI models (GPT-5, GPT-5.1, o1, o3, o4) while keeping older deployments (gpt-4, gpt-4o, gpt-4-32k, gpt-3.5-turbo, etc.) working unchanged. """ from __future__ import annotations import logging import os import re from typing import Any, Dict, Iterable, Mapping, Optional # --------------------------------------------------------------------------- # Family detection # --------------------------------------------------------------------------- _REASONING_PATTERNS = ( # gpt-5, gpt5, gpt-5.1, gpt_5, GPT 5, gpt5mini-prod-eu, ... re.compile(r"(?i)(^|[^a-z0-9])gpt[-_ ]?5(\.\d+)?([^0-9]|$)"), # o1, o3, o4, o1-mini, o3-preview ... re.compile(r"(?i)(^|[^a-z0-9])o[134](-mini|-preview)?([^a-z0-9]|$)"), ) _LEGACY_PATTERNS = ( re.compile(r"(?i)gpt[-_ ]?4o"), re.compile(r"(?i)gpt[-_ ]?4(?!\d)"), re.compile(r"(?i)gpt[-_ ]?4[-_ ]?32k"), re.compile(r"(?i)gpt[-_ ]?3\.?5"), re.compile(r"(?i)gpt[-_ ]?35"), ) def get_model_family(model_or_deployment: Optional[str]) -> str: """Return ``"reasoning"`` for GPT-5.x / o-series, ``"legacy"`` otherwise. Honours an ``OPENAI_MODEL_FAMILY`` env-var override for deployments whose user-defined name does not embed the model family (e.g. ``prod-default``). """ override = (os.getenv("OPENAI_MODEL_FAMILY") or "").strip().lower() if override in {"reasoning", "gpt-5", "gpt5", "gpt-5.1", "o-series", "o1", "o3"}: return "reasoning" if override in {"legacy", "gpt-4", "gpt4", "gpt-3.5", "gpt35", "chat"}: return "legacy" name = (model_or_deployment or "").strip() if not name: # Fail closed: when we don't know, assume legacy so old code keeps # working. Misclassifying a reasoning deployment as legacy fails fast # with a clear "Unsupported parameter" 400; the reverse silently # drops parameters the caller expected. return "legacy" for pat in _REASONING_PATTERNS: if pat.search(name): return "reasoning" for pat in _LEGACY_PATTERNS: if pat.search(name): return "legacy" return "legacy" def is_reasoning_model(model_or_deployment: Optional[str]) -> bool: return get_model_family(model_or_deployment) == "reasoning" # --------------------------------------------------------------------------- # Reasoning controls # --------------------------------------------------------------------------- _VALID_REASONING_EFFORT = {"minimal", "low", "medium", "high"} _VALID_VERBOSITY = {"low", "medium", "high"} def _coerce_choice(raw: Optional[str], valid: Iterable[str]) -> Optional[str]: if raw is None: return None value = str(raw).strip().lower() if not value: return None if value not in set(valid): logging.warning( "Ignoring unsupported value '%s'; expected one of %s", raw, sorted(valid), ) return None return value def get_reasoning_effort(override: Optional[str] = None) -> Optional[str]: return _coerce_choice( override if override is not None else os.getenv("OPENAI_REASONING_EFFORT"), _VALID_REASONING_EFFORT, ) def get_verbosity(override: Optional[str] = None) -> Optional[str]: return _coerce_choice( override if override is not None else os.getenv("OPENAI_VERBOSITY"), _VALID_VERBOSITY, ) # --------------------------------------------------------------------------- # max_completion_tokens scaling # --------------------------------------------------------------------------- def _reasoning_token_scale() -> float: """Multiplier applied to legacy ``max_tokens`` when targeting a reasoning model.""" try: scale = float(os.getenv("OPENAI_REASONING_TOKEN_SCALE", "2.5")) except (TypeError, ValueError): scale = 2.5 return scale if scale > 0 else 1.0 def _reasoning_token_floor() -> int: try: floor = int(os.getenv("OPENAI_REASONING_TOKEN_FLOOR", "4096")) except (TypeError, ValueError): floor = 4096 return floor if floor > 0 else 4096 def scale_max_tokens_for_reasoning(max_tokens: Optional[int]) -> Optional[int]: """Scale a legacy ``max_tokens`` budget up for reasoning models. ``None`` and ``-1`` ("no explicit cap") are passed through. """ if max_tokens is None: return None if max_tokens == -1: return -1 return max(int(round(max_tokens * _reasoning_token_scale())), _reasoning_token_floor()) # --------------------------------------------------------------------------- # Kwargs builders # --------------------------------------------------------------------------- _SAMPLING_KEYS = ("temperature", "top_p", "presence_penalty", "frequency_penalty") def _drop_none(mapping: Mapping[str, Any]) -> Dict[str, Any]: return {k: v for k, v in mapping.items() if v is not None} def build_openai_chat_kwargs( model: str, *, max_tokens: Optional[int] = None, temperature: Optional[float] = None, top_p: Optional[float] = None, presence_penalty: Optional[float] = None, frequency_penalty: Optional[float] = None, reasoning_effort: Optional[str] = None, verbosity: Optional[str] = None, extra: Optional[Mapping[str, Any]] = None, ) -> Dict[str, Any]: """Build kwargs for ``openai.OpenAI / AzureOpenAI .chat.completions.create``. Splat the result directly: ``client.chat.completions.create(**kwargs)``. Unsupported parameters are silently omitted for reasoning models; legacy deployments retain the historical behaviour. """ family = get_model_family(model) kwargs: Dict[str, Any] = {"model": model} # ---- output budget ---- if max_tokens is not None and max_tokens != -1: if family == "reasoning": kwargs["max_completion_tokens"] = scale_max_tokens_for_reasoning(int(max_tokens)) else: kwargs["max_tokens"] = int(max_tokens) # ---- sampling ---- if family == "legacy": kwargs.update(_drop_none({ "temperature": temperature, "top_p": top_p, "presence_penalty": presence_penalty, "frequency_penalty": frequency_penalty, })) else: for key, value in ( ("temperature", temperature), ("top_p", top_p), ("presence_penalty", presence_penalty), ("frequency_penalty", frequency_penalty), ): if value is not None: logging.debug( "Dropping unsupported parameter '%s' for reasoning model '%s'", key, model, ) # ---- reasoning controls ---- if family == "reasoning": effort = get_reasoning_effort(reasoning_effort) if effort is not None: kwargs["reasoning_effort"] = effort verb = get_verbosity(verbosity) if verb is not None: # ``verbosity`` is not a top-level kwarg in openai-python <= 1.65.x; # route it via ``extra_body`` so it lands in the JSON without a # TypeError from the SDK. kwargs.setdefault("extra_body", {})["verbosity"] = verb # ---- caller-supplied extras (already filtered) ---- if extra: for key, value in extra.items(): if value is None: continue if family == "reasoning" and key in _SAMPLING_KEYS: continue kwargs[key] = value return kwargs def build_langchain_chat_kwargs( deployment_name: str, *, max_tokens: Optional[int] = None, temperature: Optional[float] = None, top_p: Optional[float] = None, reasoning_effort: Optional[str] = None, verbosity: Optional[str] = None, ) -> Dict[str, Any]: """Build kwargs for ``langchain_openai.AzureChatOpenAI`` / ``ChatOpenAI``. Older ``langchain-openai`` releases don't expose ``max_completion_tokens`` as a top-level kwarg, so we forward it through ``model_kwargs`` (which langchain passes straight to the SDK). """ family = get_model_family(deployment_name) kwargs: Dict[str, Any] = {} model_kwargs: Dict[str, Any] = {} if max_tokens is not None and max_tokens != -1: if family == "reasoning": model_kwargs["max_completion_tokens"] = scale_max_tokens_for_reasoning(int(max_tokens)) else: kwargs["max_tokens"] = int(max_tokens) if family == "reasoning": effort = get_reasoning_effort(reasoning_effort) if effort is not None: model_kwargs["reasoning_effort"] = effort verb = get_verbosity(verbosity) if verb is not None: model_kwargs.setdefault("extra_body", {})["verbosity"] = verb else: if temperature is not None: kwargs["temperature"] = temperature if top_p is not None: kwargs["top_p"] = top_p if model_kwargs: kwargs["model_kwargs"] = model_kwargs return kwargs def get_system_role(model_or_deployment: Optional[str] = None) -> str: """Return ``"developer"`` for reasoning models when opted in, ``"system"`` otherwise. Defaulting to ``"system"`` preserves compatibility with LangChain prompt templates and SDK helpers that don't yet recognise the new role. Opt in with ``OPENAI_USE_DEVELOPER_ROLE=1`` once your stack supports it. """ if not is_reasoning_model(model_or_deployment): return "system" raw = os.getenv("OPENAI_USE_DEVELOPER_ROLE", "") return "developer" if raw.strip().lower() in {"1", "true", "yes", "on"} else "system" 4.2 What this buys you Every direct-SDK call collapses to two lines: from openai import AzureOpenAI from model_compat import build_openai_chat_kwargs client = AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], ) kwargs = build_openai_chat_kwargs( model=os.environ["OPENAI_ENGINE"], max_tokens=4096, # automatically becomes max_completion_tokens for GPT-5 temperature=0.2, # automatically dropped for GPT-5 reasoning_effort="low", # automatically dropped for GPT-4 ) response = client.chat.completions.create( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": user_input}, ], **kwargs, ) The same call site now correctly targets gpt-5.1, gpt-4o, gpt-4-32k, o3-mini, or any future deployment whose name embeds the family - and you can override with the OPENAI_MODEL_FAMILY env var when the deployment alias is opaque. 4.3 Raw HTTP call sites Some legacy code paths bypass the SDK and POST JSON directly. The same builder works there: import json import requests from model_compat import build_openai_chat_kwargs, get_system_role deployment = os.environ["OPENAI_ENGINE"] api_version = os.environ["OPENAI_API_VERSION"] endpoint = ( f"{os.environ['AZURE_OPENAI_ENDPOINT']}/openai/deployments/{deployment}" f"/chat/completions?api-version={api_version}" ) payload = { "messages": [ {"role": get_system_role(deployment), "content": system_prompt}, {"role": "user", "content": user_prompt}, ], } # Splat the kwargs into the payload, then strip the SDK-only ``model`` key. payload.update(build_openai_chat_kwargs( model=deployment, max_tokens=800, temperature=0.7, top_p=0.95, reasoning_effort="low", )) payload.pop("model", None) # ``model`` is encoded in the URL for Azure payload.pop("extra_body", None) # already on the payload root resp = requests.post( endpoint, headers={"Content-Type": "application/json", "api-key": api_key}, data=json.dumps(payload), timeout=60, ) resp.raise_for_status() 5. LangChain: the hidden stop parameter langchain.chains.sql_database.query.create_sql_query_chain calls llm.bind(stop=["\nSQLResult:"]) internally to terminate the model's output before the example block in its prompt. That stop value is forwarded to the SDK on every invocation. GPT-5.1 rejects it: openai.BadRequestError: Error code: 400 - {'error': { 'message': "Unsupported parameter: 'stop' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'stop', }} You can't reach into the chain to disable it. The clean fix is a thin AzureChatOpenAI subclass that drops stop for reasoning models only: 5.1 langchain_compat.py """LangChain-side compatibility shim for reasoning-class deployments.""" from __future__ import annotations from typing import Any, List, Optional from langchain_core.callbacks.manager import ( AsyncCallbackManagerForLLMRun, CallbackManagerForLLMRun, ) from langchain_core.messages import BaseMessage from langchain_core.outputs import ChatResult from langchain_openai import AzureChatOpenAI # use ChatOpenAI for non-Azure from model_compat import is_reasoning_model class ReasoningSafeAzureChatOpenAI(AzureChatOpenAI): """``AzureChatOpenAI`` variant that hides parameters reasoning models reject. Reasoning models (GPT-5.x, o1/o3/o4) return HTTP 400 when a request payload carries ``stop``. LangChain's SQL helpers unconditionally bind it, so the unsupported parameter reaches the SDK regardless of how the caller configured the LLM. This subclass strips ``stop`` for reasoning deployments while forwarding it unchanged for legacy GPT-4 / GPT-3.5 deployments - the behaviour is byte-identical to upstream LangChain for those models. """ def _deployment_id(self) -> str: # ``langchain-openai`` >= 0.2 exposes ``azure_deployment``; older # releases use ``deployment_name``. Either may be set by the caller. return ( getattr(self, "azure_deployment", None) or getattr(self, "deployment_name", None) or "" ) def _generate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None, **kwargs: Any, ) -> ChatResult: if is_reasoning_model(self._deployment_id()): stop = None return super()._generate(messages, stop=stop, run_manager=run_manager, **kwargs) async def _agenerate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional[AsyncCallbackManagerForLLMRun] = None, **kwargs: Any, ) -> ChatResult: if is_reasoning_model(self._deployment_id()): stop = None return await super()._agenerate(messages, stop=stop, run_manager=run_manager, **kwargs) Use it as a drop-in replacement: from langchain_compat import ReasoningSafeAzureChatOpenAI from model_compat import build_langchain_chat_kwargs llm_kwargs = build_langchain_chat_kwargs( deployment_name=os.environ["OPENAI_ENGINE"], max_tokens=6000, temperature=0, reasoning_effort="low", ) llm = ReasoningSafeAzureChatOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], azure_deployment=os.environ["OPENAI_ENGINE"], openai_api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], **llm_kwargs, ) That single substitution makes create_sql_query_chain, SQLDatabaseChain, and the ChatOpenAI-based RAG helpers all work against GPT-5.1 without any other changes. 6. The second LangChain gotcha: prose where SQL should be create_sql_query_chain is documented to return the literal string "I don't know" (or a similar fallback) when the LLM cannot form a query. The default code path takes the chain output and runs it against the database: sql = chain.invoke({...}) # -> "I don't know" result = db.run(sql) # -> sends "I don't know" to pyodbc The database faithfully returns: [42000] Unclosed quotation mark after the character string 't know'. (105) Which surfaces to the end user as a misleading "SQL syntax error". The mitigation is a one-line guard that validates the chain output looks like SQL before execution: import re _SQL_START_RE = re.compile( r"^\s*(?:WITH|SELECT|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER|MERGE|EXEC|EXECUTE|TRUNCATE)\b", re.IGNORECASE, ) def looks_like_sql(text: str) -> bool: """True only if ``text`` starts with a recognised SQL DML/DDL keyword.""" if not text or not text.strip(): return False return bool(_SQL_START_RE.match(text)) sql = extract_sql_query(chain.invoke({...})) if not looks_like_sql(sql): logging.warning("SQL chain returned a non-SQL response: %r", sql[:200]) return ( "I couldn't form a SQL query for that question. " "Please rephrase or add more context." ) result = db.run(sql) This isn't specific to GPT-5.1 - it's good hygiene for any LLM that backs a SQL agent - but the failure mode becomes much more frequent on reasoning models because they're better at refusing. 7. Cleaning Markdown out of create_sql_query_chain output Reasoning models like to wrap their answer in a markdown fence and append a "Note:" or "Explanation:" paragraph. None of that survives db.run(). A defensive extract_sql_query handles all the variants: import re def extract_sql_query(text: str) -> str: """Strip markdown fences, leading prose, and trailing explanations.""" # 1) Prefer SQL inside a markdown code fence. m = re.search(r"```(?:sql|SQL|Sql)?\s*\n(.*?)\n```", text, re.DOTALL) if m: text = m.group(1) text = text.strip() # 2) Drop any prose *before* the SQL by jumping to the first SQL keyword. m = re.search( r"(?im)^\s*(WITH|SELECT|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER|MERGE|EXEC|EXECUTE|TRUNCATE)\b", text, ) if m: text = text[m.start(1):] # 3) Cut at the first "Explanation:" / "Note:" / "This query..." marker. m = re.compile( r"(?im)^\s*(?:Explanation|Note|Notes|Here(?:'|\u2019)?s|" r"This\s+(?:query|SQL|statement|returns|counts|selects|will|gets|finds)|" r"The\s+(?:query|SQL|above|result|statement)|" r"Result|Results|Description|Output|Answer)\b[^\n]*" ).search(text) if m: text = text[: m.start()].rstrip() # 4) Drop any trailing fence that survived step 1. if text.endswith("```"): text = text[:-3].rstrip() return text.strip() 8. Package versioning The bare minimum your requirements.txt / environment.yml needs: Package Last GPT-4-only version First GPT-5.x-safe version Notes openai 1.55.x 1.65.x (recommend 1.65.4+) Earlier versions reject max_completion_tokens and reasoning_effort as unknown kwargs langchain-openai 0.2.14 0.3.7+ 0.3.x line exposes azure_deployment and forwards model_kwargs correctly to the new SDK langchain 0.3.14 0.3.21+ Pin together with langchain-openai and langchain-core langchain-core 0.3.29 0.3.49+ Update in lockstep with the others langchain-community 0.3.14 0.3.20+ Mostly transitive; needed for SQLDatabase helpers tiktoken 0.7.x 0.8.0+ Encodings for GPT-5.1 ship in 0.8.0; older versions fall back to cl100k_base for unknown models tokencost (optional) 0.1.16 0.1.20+ Update for GPT-5.x price tables Azure OpenAI API version 2024-12-01-preview 2025-03-01-preview First version that ships reasoning_effort and the GPT-5.x routing Pin exact versions after testing - LangChain has a habit of moving public re-exports between minor releases. requirements.txt snippet: openai==1.65.4 langchain==0.3.21 langchain-core==0.3.49 langchain-openai==0.3.7 langchain-community==0.3.20 tiktoken==0.8.0 9. New GPT-5.x knobs worth using Once you're on a reasoning deployment, two new parameters become available. Both are optional, both default to a sensible value, and both are stripped by the kwargs builder above when the target is a legacy model. reasoning_effort minimal - one-shot lookups, classification. low - deterministic structured output (SQL, JSON-schema extraction, rule-based rewrites). Lowest cost overhead. medium (default) - RAG, summarisation, normal Q&A. high - multi-step analytical reasoning, complex code synthesis. A useful pattern is to choose the level by task profile rather than at the call site: TASK_EFFORT = { "sql": "low", "structured_extract": "low", "kg_cleaning": "low", "rag_qa": "medium", "vision": "medium", "analytical": "high", } verbosity low | medium | high. Controls the length of the response, not its substance. Useful for grounding chat UIs where you want crisp answers - set low for /answer endpoints and high for "explain like a senior engineer" panels. Note: in openai-python <= 1.65.x, verbosity is not yet a top-level keyword argument; pass it through extra_body (the builder above already does this). developer role GPT-5.x prefers {"role": "developer", "content": "..."} for instructions that previously used system. The change is non-breaking on the Azure side - system is still accepted as an alias - but some downstream LangChain prompt templates predate the role and will reject it on construction. Treat developer as opt-in (OPENAI_USE_DEVELOPER_ROLE=1) for now; flip the default after your prompt-template version is known good. 10. Auditing your existing prompts When the wire-level migration is done your service will talk to GPT-5.x - but that doesn't mean it says the right thing. Reasoning models read prompts differently in ways that won't show up as 400s: They take instructions more literally. A prompt that worked when GPT-4o rounded the corners may surface every edge case verbatim. They refuse more often. "I don't know" / "I cannot help with that" are more frequent because reasoning models are less willing to confabulate. They ignore "be concise" / "be terse". Use the new verbosity knob. Step-by-step / chain-of-thought instructions become redundant. The model already reasons internally; extra "think before you answer" prose competes with its own chain of thought and often hurts output quality. Negative-only instructions can backfire. "Never output X" prompts occasionally cause refusals where you'd rather have a workaround. 10.1 Build a prompt regression harness Capture every system+user prompt your service emits in a CSV, then replay each one against both deployments and diff the output. The diff is the single most useful artefact you can produce before the cutover: # prompt_audit.py - minimal differential tester import csv from openai import AzureOpenAI from model_compat import build_openai_chat_kwargs LEGACY = "gpt-4o" REASONING = "gpt-5.1" client = AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], ) def run(model: str, system: str, user: str) -> str: kw = build_openai_chat_kwargs( model=model, max_tokens=4096, temperature=0.2, # auto-dropped for reasoning reasoning_effort="medium", # auto-dropped for legacy ) resp = client.chat.completions.create( messages=[ {"role": "system", "content": system}, {"role": "user", "content": user}, ], **kw, ) return resp.choices[0].message.content or "" with open("prompts.csv") as f_in, open("diff.tsv", "w", newline="") as f_out: writer = csv.writer(f_out, delimiter="\t") writer.writerow(["id", "legacy_first80", "reasoning_first80", "len_legacy", "len_new", "identical"]) for row in csv.DictReader(f_in): legacy = run(LEGACY, row["system"], row["user"]) new = run(REASONING, row["system"], row["user"]) writer.writerow([ row["id"], legacy[:80].replace("\n", " "), new[:80].replace("\n", " "), len(legacy), len(new), legacy.strip() == new.strip(), ]) Capture three signals per prompt - they're enough to triage 95% of drift: Format compliance. Did the output still parse as the expected JSON / YAML / Markdown / SQL? Run your existing downstream parser on both columns. Token cost delta. Reasoning models tend to be more verbose by default. Anything beyond +20% is a candidate for the verbosity="low" knob. Semantic drift. Spot-check 5–10% of rows by hand. You're looking for changes in intent, not changes in wording. 10.2 Common rewrites to make prompts model-agnostic The goal isn't to write two prompts. It's to write one prompt that produces correct output on both families by moving constraints out of the natural-language body and into the request shape. 10.2a. Format constraints belong in response_format, not the prose Don't: Output ONLY a JSON object with keys `name` and `score`. Do not include any explanation. Do not wrap in markdown. Do not say anything else. Do: resp = client.chat.completions.create( messages=[...], response_format={ "type": "json_schema", "json_schema": { "name": "scored_entity", "schema": { "type": "object", "properties": { "name": {"type": "string"}, "score": {"type": "number"}, }, "required": ["name", "score"], "additionalProperties": False, }, "strict": True, }, }, **kw, ) response_format is honoured by both gpt-4o (>= 2024-08-06) and the entire GPT-5.x line. The prompt loses three lines of brittle natural-language constraints and you get schema-validated output for free. 10.2b. Replace "think step by step" with reasoning_effort Don't: Let's think step by step. First identify the entity. Then find the category. Then compute the score. Then format the answer. Do: delete the prose and pass reasoning_effort="medium" (or "high") for reasoning deployments. The kwargs builder drops the parameter automatically for GPT-4 models, so the same prompt now produces: step-by-step reasoning internally on GPT-5.x (lower output token cost), the same final answer on GPT-4o that the verbose prompt used to elicit. 10.2c. Replace temperature-based variety with n sampling If your code relied on temperature=0.9 to get diverse completions, GPT-5.x will return roughly the same answer every time. Generate variety the explicit way: resp = client.chat.completions.create(messages=[...], n=5, **kw) candidates = [c.message.content for c in resp.choices] Or call the model N times with slightly different framings. Both patterns work against either family with no further code changes. 10.2d. Move procedural instructions to the developer role For multi-step workflows, the new developer role gives clearer separation between what the system enforces and what the user is asking: messages = [ {"role": get_system_role(deployment), "content": role_card_for_assistant}, {"role": "developer", "content": procedural_instructions}, {"role": "user", "content": user_question}, ] get_system_role returns "system" for legacy models and "developer" for reasoning models opted in via OPENAI_USE_DEVELOPER_ROLE=1. Once your LangChain templates support the new role you can flip the default. 10.2e. Add a literal-execution header for strict formats For prompts where the exact output shape matters (table generation, SQL with a fixed column order, structured incident reports), prepend an explicit literal-execution header so reasoning models don't drift into "helpful improvements": LITERAL_EXECUTION_HEADER = ( "Execution mode: follow the instructions below literally and in order. " "Do not infer intent, skip, reorder, merge, or add steps. Honour the " "exact formatting, tone, and verbosity specified. If a step is " "ambiguous, respond with the literal interpretation and flag the " "ambiguity instead of guessing." ) def apply_literal_execution(prompt: str) -> str: if LITERAL_EXECUTION_HEADER in prompt: return prompt return f"{LITERAL_EXECUTION_HEADER}\n\n{prompt}" It's a no-op on GPT-4o (the older models already follow instructions literally enough) and a meaningful guard rail on GPT-5.1. Wire it behind an OPENAI_LITERAL_EXECUTION flag so you can disable it without redeploying. 10.3 A prompt-shaped checklist Run every prompt your service emits past these questions: Question Action Does it specify output format in prose? Move to response_format (10.2a) Does it include "think step by step"? Remove; set reasoning_effort (10.2b) Does it set tone constraints ("be concise")? Use verbosity Does it use negative-only instructions ("never X")? Add positive alternative ("do Y instead") Does it embed example outputs with values that would change? Replace concrete values with placeholder tokens (<VALUE>) Does it rely on temperature > 0 for variety? Use n=K sampling (10.2c) Is the system prompt > 2k tokens? Split into role-card (system) + procedure (developer) Does output ordering matter? Add the literal-execution header (10.2e) 10.4 Score before you ship Don't approve a rewritten prompt by eyeballing one example. Score it: Format compliance rate. Percentage of N=50 outputs that pass your existing downstream parser / JSON schema validation. Token cost delta. Cap regression at +20% versus the legacy baseline. Beyond that, dial verbosity="low" or tighten the prompt. Latency p50 / p95 delta. Reasoning models add tail latency. If your SLA is tight, set reasoning_effort="low" for the path or move it to a background queue. A prompt that regresses on any of those by more than your tolerance window ships behind a feature flag with rollback wired in. 11. Testing strategy Two test layers catch >90% of regressions: Family-classification tests import pytest from model_compat import get_model_family, build_openai_chat_kwargs @pytest.mark.parametrize("name,expected", [ ("gpt-5.1", "reasoning"), ("gpt5", "reasoning"), ("gpt-5-prod-eu", "reasoning"), ("o3-mini", "reasoning"), ("o1", "reasoning"), ("gpt-4o", "legacy"), ("gpt-4", "legacy"), ("gpt-4-32k", "legacy"), ("gpt-35-turbo", "legacy"), ("", "legacy"), # unknown -> fail closed to legacy (None, "legacy"), ]) def test_family(name, expected): assert get_model_family(name) == expected def test_kwargs_for_reasoning_drops_temperature(): kw = build_openai_chat_kwargs( model="gpt-5.1", max_tokens=1000, temperature=0.2, top_p=0.9, reasoning_effort="low", ) assert "temperature" not in kw assert "top_p" not in kw assert kw["max_completion_tokens"] >= 4096 # floor applied assert kw["reasoning_effort"] == "low" def test_kwargs_for_legacy_keeps_temperature(): kw = build_openai_chat_kwargs( model="gpt-4o", max_tokens=1000, temperature=0.2, top_p=0.9, ) assert kw["max_tokens"] == 1000 assert kw["temperature"] == 0.2 assert kw["top_p"] == 0.9 assert "reasoning_effort" not in kw Wire-level smoke tests For each LLM call site you maintain, write a single integration test that exercises the chain against a real (or mocked) endpoint and asserts: HTTP 200, non-empty content, finish_reason != "length" (so you catch silent truncation), (optional) classifier-style assertions against a golden output. Run those tests once against the legacy deployment and once against the new one - same test code, two OPENAI_ENGINE values. 12. Things that don't change It's easy to over-correct. Several pieces of plumbing keep working without modification: Authentication. AAD token providers, managed identity, and API keys are unchanged. Embeddings. text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002 are not part of the reasoning generation; the embeddings call shape is identical. Function calling / tool use. Same JSON schema, same response shape. Streaming. SSE format is unchanged. Token counters. tiktoken still works, but bump to 0.8.0+ so the new model name resolves to the right encoding instead of silently falling back to cl100k_base. 13. Next steps If you only do four things from this post, do these - in order: Deploy a GPT-5.1 model side-by-side with your current GPT-4 deployment in Microsoft Foundry. Keep the GPT-4 deployment live; you'll need both for the parallel-run period. Drop model_compat.py and langchain_compat.py into your project (Sections 4 and 5). Replace every AzureChatOpenAI(...) construction with ReasoningSafeAzureChatOpenAI and route every kwargs literal through the builders. Run the prompt-audit harness (Section 10.1) against your top 50 most frequently invoked prompts. Triage the diff with the checklist in 10.3. Roll out behind a percentage-based flag. Start at 5% of traffic for 24 hours, compare quality and cost telemetry against the GPT-4o baseline, then ramp. Reference material Azure OpenAI in Microsoft Foundry - model overview Azure OpenAI model retirements and deprecations Reasoning models in Azure OpenAI Structured Outputs in Azure OpenAI openai-python SDK changelog langchain-openai release notes Talk to us Open an issue on the Microsoft Foundry GitHub samples repository if you hit a gap this post didn't cover. Share your migration story or numbers in the comments below - field data is the fastest way to make this guide better for the next team. If you operate a regulated workload (finance, health, public sector) and need help sequencing the rollout with your model retirement deadlines, reach out to your Microsoft account team or a Microsoft Foundry partner. GPT-5.x is the first major model bump in two years that requires code changes - but the changes collapse into one small compatibility module and a one-line LangChain subclass. With those in place your code is forwards-compatible (works on reasoning models today) and backwards- compatible (still works on every GPT-4 deployment you haven't migrated yet). The investment pays a recurring dividend: when the next reasoning bump ships, the only file that needs updating is model_compat.py. Appendix A - Minimal .env template # Endpoint and auth (unchanged between families) AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<key> # The deployment name decides the family. The classifier reads it. OPENAI_ENGINE=gpt-5.1 OPENAI_API_VERSION=2025-03-01-preview # Optional override for opaque deployment names # OPENAI_MODEL_FAMILY=reasoning # or "legacy" # Optional reasoning controls (ignored for legacy deployments) OPENAI_REASONING_EFFORT=medium OPENAI_VERBOSITY=medium OPENAI_REASONING_TOKEN_SCALE=2.5 OPENAI_REASONING_TOKEN_FLOOR=4096 # Flip when your LangChain templates support it # OPENAI_USE_DEVELOPER_ROLE=1 Appendix B - One-liner sanity checks # Does a deployment name classify correctly? python -c "from model_compat import get_model_family; print(get_model_family('gpt-5.1'))" # -> reasoning # Does the LangChain LLM strip ``stop`` when the deployment is GPT-5.1? python -c " from langchain_compat import ReasoningSafeAzureChatOpenAI import inspect; print(inspect.getsource(ReasoningSafeAzureChatOpenAI._generate)) " Companion repository: drop model_compat.py and langchain_compat.py next to each other in your utils/ package. They are zero-dependency on import, so you can vendor them into any service - web, function, batch job - without dragging Azure SDK or LangChain into module-load.
__sourav_sahu__
Jul 16, 2026 Place Microsoft Foundry Blog
864Views
2likes
1Comment
Data Visualisation / Charting in Azure Foundry
Hi Foundry community, We are working on an agent that can query internal data sources, and are looking for ways that we can visualise data (think pie charts, bar charts, etc.). This would be consumed by end users through Copilot/Teams. However we are unable to find a way to do so, which is surprising given that you easily can create charts through M365 Copilot Chat and through Copilot Studio. We have tried using the 'Code Interpreter' tool, but the Teams/Copilot client UIs just do not render the results inline, either interactive or as an embedded image. They also do not give any option to download them. Has anyone tackled this before? How have you been able generate charts? Many thanks!
jherbert44
Jul 05, 2026 Place Microsoft Foundry Discussions
40Views
0likes
1Comment
From RAG to agents: Build AI pipelines inside Azure HorizonDB
By Abe Omorogbe, Navya Teja Gajula, Binnur Gorer, B Harsha Kashyap, Krishnakumar Ravi (KK) from Microsoft PostgreSQL AI team If you’ve ever shipped a RAG app, this will feel familiar. Your data lives in Postgres. But the pipeline that turns that data into vectors lives somewhere else, spread across external services, queues, and retry logic. And when the embedding API hiccups mid-batch? That’s a 2 a.m. production incident. You didn’t set out to build your own embedding service. You just wanted to search your documents. And RAG is only the beginning. The moment AI works on your data: extraction, summarization, reranking, keeping embeddings fresh, or powering agent, you’re back to stitching together more services, queues, and glue code, all outside the database. AI pipelines in Azure HorizonDB (Preview) removes that entire stack. Define your workflows steps like chunking, embeding, extracting, and generating in SQL, and HorizonDB runs them as AI pipelines next to your data. No orchestrator. No glue code. Just Postgres. In this post we'll cover: The external-orchestrator issue that every AI on Postgres team eventually hits What AI pipelines are, and the four-part anatomy that makes them click Use cases worth trying: semantic search, knowledge extraction, content generation, smarter reranking, and always-fresh embeddings How to watch your pipelines run as live graphs in VS Code How to spin up HorizonDB and run your first pipeline today 🚀 Try it on Azure HorizonDB. AI pipelines are built into Microsoft's new PostgreSQL cloud service, no extra infrastructure to stand up. Write ai.create_pipeline(...), call ai.run(...), and it runs. Get started in HorizonDB → AI preprocessing runs outside the database, far from your data The standard way to get data into a vector store looks reasonable on a whiteboard: a service reads source rows, calls an embedding API, and writes chunks back to Postgres. However, some interesting issues often occur in production. The embedding API fails mid-batch, and there's no shared checkpoint showing which rows were completed. You rerun the job, and the extra API calls increases cost. A worker crashes after writing chunks but before flipping the parent row's processed flag. Now your embeddings are quietly inconsistent, and nobody knows. Every one of these is the same missing primitive: durable, checkpointed execution that lives where your data lives. External orchestrators can do it, but now you're operating a second service just to feed the first one. AI pipelines move that logic into HorizonDB itself. The source, the steps, the sink, and the full run history are all SQL protected by the same transactions, backups, and point-in-time restore your data already has. The database is already where your data commits. It's a natural place for the pipeline to live too. Anatomy of an AI pipeline in HorizonDB are optional and can be adjusted as needed. A pipeline has four parts: Source: where rows come from. A table_source(...) over a HorizonDB table, optionally with an incremental_column so the pipeline skips rows it already processed. Steps: the AI operations that transform each row, in order. Each step appends columns to the in-flight batch. Sink: where results land, ready for use by your AI apps or agent. Trigger: 'on_change' (run automatically when source rows change) or 'manual' (run only when you call ai.run()). Those four parts give the pipeline its shape. The steps are where you define the AI work itself, using composable building blocks: Step What it does ai.chunk() Split long text into overlapping chunks ai.embed() Generate vector embeddings ai.extract() Pull structured fields out of text with an LLM ai.generate() Generate text from a prompt (i.e content generation, classify, summarize and more) ai.rank() Score documents against a query How the pieces fit together. The ai.* API gives you the AI pipeline shape: sources define where data comes from, steps define the AI work to perform, sinks define where results land, and triggers define when the pipeline runs. Under the covers, HorizonDB turns that definition into a durable execution graph, where each step can be checkpointed, retried, and resumed if something fails. Built on open source. That durability isn't magic, every AI pipeline compiles down to a graph that runs on pg_durable, Microsoft's open-source durable-execution engine for PostgreSQL (built on the duroxide Rust runtime). The ai.* API is the AI-shaped surface (sources, steps, sinks, triggers) and pg_durable is the general-purpose engine underneath that handles checkpointing, retries, and crash recovery. So, your pipelines stand on a transparent, inspectable foundation you can read, and run on any Postgres 17 & 18. No black box, no lock-in. Use case 1: Semantic search over your data This is one of the most popular use cases. Turn a table of documents into searchable vectors, durably, and keep them fresh as the data changes. That last part matters: in production, documents are edited, added, and deleted constantly, and every change needs the right chunks and embeddings updated without reprocessing the entire corpus or leaving stale vectors behind. With AI pipelines, HorizonDB can track those incremental updates for you. Chunk the body, embed each chunk, and land the result in a DiskANN-indexed table. -- Define the pipeline: source -> chunk -> embed -> sink. SELECT ai.create_pipeline( name => 'rag_pipeline', source => ai.table_source(table_name => 'documents'), steps => ARRAY[ ai.chunk(input => 'content', chunk_size => 512, overlap => 64), ai.embed(model => 'default-embedding', input => 'chunk_text', dimensions => 1536) ], trigger => 'on_change', -- re-embed automatically as rows change sink => ai.table_sink('rag_pipeline_output') ); -- Run it SELECT ai.run('rag_pipeline'); -- Search your data SELECT chunk_text, embedding <=> azure_openai.create_embeddings('text-embedding-3-small', 'how does vector search work?')::vector AS distance FROM rag_pipeline_output ORDER BY distance LIMIT 3; 📘 Read more details in the AI Pipelines documentation That's the entire ingestion layer; chunking, embedding, checkpointing, retries, and sink writes in one definition. Because trigger => 'on_change', the pipeline updates embeddings whenever source rows change, processing only what is new or modified instead of redoing the whole corpus. Your vectors stay in sync with your data, and your ingestion work stays efficient as the dataset grows. Point a query at the DiskANN index and you've got production semantic search without a single line of application glue. That's the whole loop: define, run, inspect. The embedding service you were about to build the queue, the workers, the retry logic, the checkpoint table, the 2 a.m. production incident doesn't happens. Why it's better than an external service: a failure in ai.embed() never re-runs ai.chunk(), each step is a durable node. If the database restarts mid-run, it resumes from the last checkpointed batch, not row zero. Use case 2: Turn unstructured text into structured metadata Support tickets, contracts, product reviews, research papers are full of structure that's locked inside unstructured documents. ai.extract() pulls named fields out of text and merges them into the metadata JSONB column, so you can filter and aggregate on things an LLM read for you. SELECT ai.create_pipeline( name => 'extraction_pipeline', source => ai.table_source(table_name => 'documents'), steps => ARRAY[ ai.chunk(input => 'content'), ai.extract( input => 'chunk_text', data => ARRAY['topics: string - the main topics discussed', 'entities: string - named people, products, or places'] model => 'my-gpt' -- optional, the default model when AI model management is activate ) ], sink => ai.table_sink('extraction_pipeline_output') ); SELECT ai.run('extraction_pipeline'); -- Now query the structured fields the LLM extracted: SELECT doc_id, metadata->'topics' AS topics, metadata->'entities' AS entities FROM extraction_pipeline_output; 📘 Read more details in the AI Pipelines documentation You describe each field as a label: description string in the ai.extract step, and HorizonDB does the rest durably, in bulk, with the same retry-and-resume guarantees. Each field is a label, either a bare name like product, or the detailed form name: type - description (for example `sentiment: number - sentiment score from 1 to 5`). HorizonDB does the rest, durably, in bulk, with the same retry-and-resume guarantees. Use case 3: Summarize and rewrite content at scale ai.generate() runs an LLM prompt against every row, perfect for bulk summarization, classification, tone normalization, or generating titles. Because it's a pipeline, "summarize 4 million documents" becomes a job that survives restarts instead of a script you have to monitor overnight. SELECT ai.create_pipeline( name => 'summary_pipeline', source => ai.table_source(table_name => 'documents'), steps => ARRAY[ ai.chunk(input => 'content'), ai.generate( input => 'chunk_text', system_prompt => 'Create a concise summary in 50 words or fewer.' model => 'my-gpt' -- optional, the default model when AI model management is activate ) ], sink => ai.table_sink('generation_pipeline_output') ); SELECT ai.run('summary_pipeline'); -- Now query the generated text: SELECT doc_id, left(generated_text, 100) AS summary_preview FROM generation_pipeline_output WHERE generated_text IS NOT NULL LIMIT 5; 📘 Read more details in the AI Pipelines documentation Swap the system_prompt and the same shape becomes a classifier ("Label this ticket as billing, bug, or feature request"), a translator, or a headline generator. The instruction goes in system_prompt; the result lands in generated_text. Use case 4: Keep embeddings fresh, and re-embed cleanly when the model changes This is where AI pipelines become especially useful. In a real AI app, two things change constantly: your data and your model. AI pipelines are designed to handle both changes directly. Your data changes. Set incremental_column and an on_change trigger, and the pipeline only embeds new or changed rows, automatically, forever, until you pause or drop it. SELECT ai.create_pipeline( name => 'rag_pipeline', source => ai.table_source( table_name => 'documents', incremental_column => 'updated_at' -- only process what changed ), steps => ARRAY[ ai.chunk(input => 'content'), ai.embed(model => 'default-embedding', input => 'chunk_text', dimensions => 1536) ], trigger => 'on_change', sink => ai.table_sink('rag_pipeline_output') ); Your model changes. Bump the model or the dimensions, then run a single, resumable backfill, no migration script, no babysitting: TRUNCATE rag_pipeline_output; SELECT ai.backfill('rag_pipeline'); 📘 Read more details in the AI Pipelines documentation The backfill runs as one durable instance. If the database restarts mid-backfill, it picks up from the last checkpointed batch instead of starting over. The painful "re-embed everything" migration becomes a one-liner you can actually trust. Watch your pipelines run as live graphs in VS Code A pipeline you can see is a pipeline you can trust. Install the PostgreSQL extension for VS Code, connect to HorizonDB, then right-click your database and open Pipelines & Workflows → AI Pipelines. Select any run and the center pane renders the execution as a color-coded graph: Blue 🔵 : source and sink (where data enters and exits) Green 🟢 : processing steps (chunk, embed, extract, generate, rank) Pink 🟣 : external model and service calls For each run you can read the status (completed, running, failed), the run ID for traceability, start time and duration for performance, and a link back to the pipeline definition. When a run fails, open the graph and jump straight to the step where execution stopped, no log spelunking. Get Started: Try It Now We have a few demoes of AI pipelines in action: Resource Link Microsoft Build AI Pipeline Demo Simplify app dev with cloud-native PostgreSQL in Azure HorizonDB | DEM364 Microsoft Build AI Pipeline GitHub AI Pipelines Demo GitHub Repo | DEM364 Microsoft Mechanic Demo AI Pipeline Demo on Microsoft Mechanic Documentation AI pipelines on HorizonDB Enabling AI pipelines takes minutes: enable to azure_ai, pg_durable, vector and pg_diskann extensions and you can get started. -- On Azure HorizonDB — the extensions are built in. CREATE EXTENSION IF NOT EXISTS pg_durable; CREATE EXTENSION IF NOT EXISTS azure_ai; CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION IF NOT EXISTS pg_diskann; That's it, your PostgreSQL database can now run AI pipelines Learn more MS Learn AI pipelines on HorizonDB: Azure HorizonDB Preview pg_durable on GitHub (open source) MS Learn Durable Functions on HorizonDB Scalable vector search with DiskANN PostgreSQL extension for VS Code
abeomor-msft
Jul 02, 2026 Place Microsoft Blog for PostgreSQL
297Views
1like
0Comments
Foundry Agent deployed to Copilot/Teams Can't Display Images Generated via Code Interpreter
Hello everyone, I’ve been developing an agent in the new Microsoft Foundry and enabled the Code Interpreter tool for it. In Agent Playground, I can successfully start a new chat and have the agent generate a chart/image using Code Interpreter. This works as expected in both the old and new Foundry experiences. However, after publishing the agent to Copilot/Teams for my organization, the same prompt that works in Agent Playground does not function properly. The agent appears to execute the code, but the image is not accessible in Teams. When reviewing the agent traces (via the Traces tab in Foundry), I can see that the agent generates a link to the image in the Code Interpreter sandbox environment, for example: `[Download the bar chart](sandbox:/mnt/data/bar_chart.png)` This works correctly within Foundry, but the sandbox path is not accessible from Teams, so the link fails there. Is there an officially supported way to surface Code Interpreter–generated files/images when the agent is deployed to Copilot/Teams, or is the recommended approach perhaps to implement a custom tool that uploads generated files to an external storage location (e.g., SharePoint, Blob Storage, or another file hosting service) and returns a publicly accessible link instead? I've been having trouble finding anything about this online. Any guidance would be greatly appreciated. Thank you!
jasonpriebe
Jul 01, 2026 Place Microsoft Foundry Discussions
267Views
1like
1Comment
FPGA vs ASIC for AI at the Edge: What factors influence your hardware choice?
As AI continues to move closer to edge devices, choosing the right hardware platform has become an important design decision. While both FPGAs and ASICs have their strengths, the best choice often depends on the application's requirements. Here are some of the key factors that engineering teams typically evaluate: Performance and latency requirements Power efficiency Development cost and NRE Time-to-market Production volume Need for future hardware updates FPGAs offer flexibility for rapid prototyping and evolving workloads, making them well-suited for early-stage development. ASICs, on the other hand, can provide significant advantages in performance, power consumption, and cost efficiency for high-volume production. I recently came across a technical article that explains these trade-offs in a structured way and found it useful as a reference: https://www.signoffsemiconductors.com/asic-vs-fpga/ I'd be interested to hear how others approach this decision. Have you migrated a design from FPGA to ASIC? What factors influenced your choice? Are there workloads where you would always choose one over the other?
Venkatesh007
Jul 01, 2026 Place Azure
64Views
0likes
1Comment
Weird problem when comparing the answers from chat playground and answer from api
I'm running into a weird issue with Azure AI Foundry (gpt-4o-mini) and need help. I'm building a chatbot that classifies each user message into: follow-up to previous message repeat of an earlier message brand-new query The classification logic works perfectly in the Azure AI Foundry Chat Playground. But when I use the exact same prompt in Python via: AzureChatOpenAI() (LangChain) or the official Azure OpenAI code from "View Code" (client.chat.completions.create()) …I get totally different and often wrong results. I’ve already verified: same deployment name (gpt-4o-mini) same temperature / top_p / max_tokens same system and user messages even tried copy-pasting the full system prompt from the Playground But the API version still behaves very differently. It feels like Azure AI Foundry’s Chat Playground is using some kind of hidden system prompt, invisible scaffolding, or extra formatting that is NOT shown in the UI and NOT included in the “View Code” snippet. The Playground output is consistently more accurate than the raw API call. Question: Does the Chat Playground apply hidden instructions or pre-processing that we can’t see? And is there any way to: view those hidden prompts, or replicate Playground behavior exactly through the API or LangChain? If anyone has run into this or knows how to get identical behavior outside the Playground, I’d really appreciate the help.
Rakanid
Jun 29, 2026 Place Azure
235Views
0likes
2Comments