ai agents
133 TopicsIntroducing Durable Functions in PostgreSQL
By Abe Omorogbe, Senior PM | Pino De Candia, Principal Software Engineer | TJ Green, Principal Software Engineer Postgres will happily store your data, run your queries, and scale with you for years. But the moment you need to do more with that data, such as running multi-step transformation, scheduling nightly rollups, generating embeddings or waiting on an approval, you hit a wall. Postgres has no built-in way to run long-lived, fault-tolerant work. That's why we built pg_durable, a new open-source PostgreSQL extension that brings durable execution directly into the database. With pg_durable, Postgres doesn’t just store your data, it runs long-lived, fault-tolerant workflows on it, with built-in retries, parallelism, scheduling, and recovery. Instead of stitching together PL/pgSQL functions or building external orchestration systems, you can now define and run resilient workflows entirely in your database, backed by Postgres' durability and high availability. And on Azure HorizonDB, pg_durable also powers AI pipelines, enabling production-ready data and AI workflows, end-to-end, right inside the database. In this post, we'll cover: The hidden trap: blocking background work What pg_durable is and the DSL that drives it How this engine powers AI pipelines on HorizonDB Sample patterns worth exploring Getting started on HorizonDB, on your laptop, and in VS Code 🚀 Want to try it out? pg_durable ships in Azure HorizonDB, Microsoft's new PostgreSQL cloud service. The HorizonDB Preview is the fastest way to try pg_durable and AI pipelines together. Get started in HorizonDB → pg_durable visualization The hidden trap: blocking background work Most Postgres teams eventually reach a point where they need to run critical tasks on their data: transformations, nightly aggregations, database maintenance workflows, embedding jobs, or multi-step business processes. So, they do the natural thing and try to keep that work inside Postgres. They end up on a journey of increasing complexity and maintenance burden. First, just run the task as a function in your database You cram the whole workflow into one PL/pgSQL function: loop, transform, call APIs, write results, return. It looks simple until you have to run it in production. One connection stays tied up the whole time. Everything runs inside one big transaction, with long locks and no visibility into partial progress. If the connection drops or the database restarts, the whole run is gone. No per-step retries. No parallelism. No scheduling. No clean way to pause for human input. When it fails, you move it outside You push the workflow into an external service: a job queue, polling workers, state tables, step coordination, retry logic, crash-recovery sweeps, and cleanup jobs. What started as a few background tasks turns into a full distributed system. Before you’ve even touched the business logic, you’re building and operating infrastructure just to coordinate work that’s still fundamentally tied to your data. Both paths are workarounds for the same missing primitive: durable, asynchronous background work that lives where your data lives. That's the gap pg_durable fills. What pg_durable actually is pg_durable is a Postgres extension that consists of a DSL (Domain specific language) and the duroxide runtime hosted in a Postgres background worker. You describe a workflow as a small SQL expression, call df.start(...), and get an instance ID back immediately. The work runs off to the side in a background worker, so it never blocks your connection or transaction, and you can check progress later with df.status() and df.result(). The execution state lives in Postgres, which means it benefits from the database’s durability, HA, backups, and recovery. Additionally, the workflow definition does not have to live in the database: your application can send it to df.start(...) over a regular Postgres connection. 2: pg_durable orchestration of worker and schema Because execution is asynchronous, pg_durable automatically breaks a workflow into discrete steps. Each step runs in its own session and transaction, commits its progress, and hands off to the next instead of keeping one giant transaction open. Steps are checkpointed in Postgres and recovered by deterministic replay, so workflows survive crashes, restarts, and failovers and resume where they left off. If a step fails, only that step retries. The whole thing is expressed through a tiny DSL of composable operators: Operator Meaning ~> Sequential. run this, then that & Parallel. fan out, wait for all | Race. fan out, take the first to finish ?> / !> Conditional. if / else @> Loop. repeat durably, survive restarts |=> Capture a step's result into a variable (reuse with $) Advanced Functions df.if() Conditional branch df.loop() Repeat statements df.join() Execute in parallel, wait for all df.http() To call an allowlisted endpoint df.wait_for_schedule() For cron-style timing df.wait_for_signal() Pause for an external event Read more about all operators and functions in pg_durable Without pg_durable vs. with pg_durable The hand-rolled version of "run three aggregations in parallel, then refresh a dashboard with retries and crash recovery" usually means 300+ lines of queue tables, polling workers, state-machine rows, per-step retry logic, crash-recovery sweeps, and cleanup jobs. Plus, the runbook to operate it. The pg_durable version: SELECT df.start( 'SELECT count(*) FROM users' & 'SELECT count(*) FROM orders' & 'SELECT sum(amount) FROM orders' ~> 'REFRESH MATERIALIZED VIEW metrics', 'refresh-dashboard' ); You write the SQL. pg_durable owns the queue, the state, the coordination, the retries, and the crash recovery. Two ways to use pg_durable 1: Use pg_durable directly (works on Azure HorizonDB or any Postgres 17) Enable it and start orchestrating: CREATE EXTENSION pg_durable; SELECT df.start($$ SELECT 'Hello, durable world!' AS message $$); -- returns an instance ID immediately; the worker runs it asynchronously From there you compose: sequential pipelines, conditional branches, races for timeout-or-result, variable passing between steps, human-in-the-loop approvals, scheduled maintenance all in SQL, close to the data, with no new infrastructure. This is the "just use Postgres" answer to a problem teams usually solve by leaving Postgres. Because it's open source under the permissive PostgreSQL License, you can clone the repo and run it on your laptop, your server, or any cloud. 2: AI pipelines (HorizonDB capability) On HorizonDB, pg_durable becomes the foundation for something even more approachable: a managed, declarative AI pipeline surface in the azure_ai extension. pg_durable gives you the durable execution engine, while the ai.* API gives you an AI-shaped model of sources, steps, sinks, and triggers that compile into a durable graph. Traditional app-tier embedding pipelines fail in predictable ways: a transient API error mid-batch with no shared checkpoint, a worker that crashes after writing chunks but before marking the parent row processed, no clean way to re-embed just the rows that changed. Move that logic into HorizonDB and the source, the steps, the sink, and the run history are all SQL, protected by the same transactions, backups, and PITR (point-in-time recovery) your data already has. A complete chunk → embed AI pipeline is one definition: SELECT ai.create_pipeline( name => 'ai_pipeline', source => ai.table_source(table_name => 'documents_ai_pipeline'), steps => ARRAY[ ai.chunk(input => 'content'), ai.embed(model => 'default-embedding', input => 'chunk_text', dimensions => 1536) ], trigger => 'on_change', sink => ai.table_sink('documents_ai_pipeline_output') ); SELECT ai.run('ai_pipeline'); Each AI step becomes a durable node, so if ai.embed() fails, ai.chunk() doesn’t run again. And with trigger => 'on_change', the pipeline runs automatically as rows change, embedding only what’s new. Add a DiskANN index on the resulting table, and you have production-ready vector search end to end, entirely inside the database. Where pg_durable fits and where it doesn't If you've used external orchestrators such as Temporal or Airflow, your first reaction is probably: why would I put control flow in my database? Fair question. pg_durable isn't trying to be a universal orchestrator. Reach for pg_durable when the workflow is tightly coupled to Postgres state. The rows it reads and writes live in the same database, it benefits from the database's own durability, backups, and PITR, and you'd rather not stand up a separate system to coordinate work that never leaves the data tier. Think: embedding pipelines, ETL jobs, scheduled maintenance, and queue-style background jobs. Reach for a dedicated orchestrator when the workflow's center of gravity is outside Postgres, fanning across heterogeneous services, or running arbitrary application logic that does not map cleanly to SQL steps, branching, loops, or HTTP calls. Get started On Azure HorizonDB CREATE EXTENSION IF NOT EXISTS pg_durable; -- Execute a simple SQL query as a durable function SELECT df.start($$ SELECT 'Hello, durable world!' AS message $$); -- Returns: a1b2c3d4 (8-character instance ID) -- Get result of a specific instance SELECT df.result(<ID>); That's it: submit, walk away, inspect. Read the documentation for more details. In VS Code, with the PostgreSQL extension A dense one-liner of ~>, &, and |=> is precise once it clicks, but the learning curve is real so flatten it with tooling. Install the PostgreSQL extension for VS Code from the Marketplace: Connect to HorizonDB or your local Postgres directly from the extension Let Copilot write the SQL. The pg-durable-sql skill turns a plain-English description ("every night, archive orders older than 90 days") into correct pg_durable syntax. Run it and watch it. The extension renders pg_durable workflows and azure_ai pipelines as live graphs, definition and each run, so you can see every step, its timing, and exactly where a failure happened. Authoring, execution, run visualization, and inspection in one window and the same tooling works against any Postgres, not just HorizonDB. On your laptop Prefer to run it yourself? Clone microsoft/pg_durable, use the Codespace prebuild or VS Code Dev Container, and add the extension on any Postgres 17. Sample patterns worth exploring The scenario guide has a full catalog of scenarios; however, these are the three I would start with. ETL Pipeline: a multi-step data transformation where each step must be completed before the next begins. Failures should stop the pipeline. SELECT df.start( 'DELETE FROM target WHERE loaded_at < now() - interval ''7 days''' -- Step 1: Cleanup old ~> 'UPDATE staging SET processed_at = now() WHERE processed_at IS NULL' -- Step 2: Mark staging ~> 'INSERT INTO target (data, source_id) SELECT data, source_id FROM staging WHERE processed_at IS NOT NULL', -- Step 3: Load 'etl-pipeline' -- Label for easy identification ); If the database restarts mid-backfill, it picks up from the last checkpointed batch, not row zero. See full example Scheduled Data Sync: poll an external API or run a job on a schedule (hourly, daily, every 30 minutes). The job should run forever and survive restarts. (See full example): -- Scheduled sync: fetch data every 30 minutes (runs forever) SELECT df.start( @> ( -- @> creates an eternal loop -- Fetch from external API (df.http( 'https://httpbingo.org/json', 'GET' ) |=> 'response') -- Store the response ~> 'INSERT INTO external_data_sync (data) VALUES ($response::jsonb)' -- Wait for next scheduled run ~> df.wait_for_schedule('*/30 * * * *') -- Cron: every 30 minutes ), 'scheduled-data-sync' ); Human-in-the-loop approval: auto-apply routine changes, pause the risky ones until a person signals approval (See full example): SELECT df.start( 'SELECT amount > 10000 AS needs_review FROM invoices WHERE id = 42' |=> 'risky' ?> ( df.wait_for_signal('invoice-42') ~> 'UPDATE invoices SET status = ''paid'' WHERE id = 42' ) !> 'UPDATE invoices SET status = ''paid'' WHERE id = 42', 'invoice-approval' ); The workflow simply waits minutes or days until a reviewer releases it with the matching signal, then resumes. The community is already running with it pg_durable launched as open source and the community is already kicking the tires. The project was a top article on Hacker News on launch day and 1.7K stars on GitHub within its first few days of initial launch. Also Franck Pachot (PostgreSQL community veteran) published an independent walkthrough, Getting Started with pg_durable: durable workflows inside PostgreSQL within days of release. The repo is actively developed, and the maintainers are reading every issue and PR. If you want improvements in our DSL ergonomics, say so. If you want an operator that doesn't exist yet, open an issue. If you've got a scenario we haven't covered, send a PR. The syntax, the docs, and the rough edges all get better when people who run Postgres in production push back. So, clone it, and build something real. If you find rough edges, open an issue or send a PR at microsoft/pg_durable. We think you'll be surprised by how much it can take. Learn more pg_durable on GitHub Durable Functions on HorizonDB AI pipelines on HorizonDB413Views2likes0CommentsCopilot Podcast Creations Stuck in “Creating” State
My Copilot podcast creations are stuck in the “Creating” state and will not delete. The delete button is greyed out. I have already tried closing all browsers, restarting my iPad, phone, and PC, and the issue persists across all devices. It looks like the jobs are stuck in the backend queue and cannot be cleared from my side. I also attempted to open a support chat, but the service request auto‑closed before an agent joined. Is there a way to clear these stuck podcast creations on the backend, or can someone from Microsoft confirm if this is a known issue?11Views0likes0CommentsFile Uploads Not Passed to Custom Engine Agent in Microsoft 365 Copilot Chat
Hi all, I'm working with a custom agent built in Copilot Studio (full authoring experience — topics, knowledge sources, agent flows) published to both the Microsoft Teams channel and the Microsoft 365 channel. I've noticed a significant UX discrepancy when it comes to file and image attachments, and I want to confirm my understanding and check whether any workaround or roadmap item exists. What works: ✅ File/image uploads work as expected in the Copilot Studio test pane ✅ File/image uploads work in Teams chat when interacting with the agent What doesn't work: ❌ File/image uploads do not reach the agent when interacting via the Microsoft 365 Copilot app (both the desktop app and the web experience at microsoft365.com) The UX problem: The M365 Copilot app presents a "+" button in the chat input area with options including "Upload" and "Take screenshot." Users naturally assume these options work. The file even appears as an attachment in the sent message — but the agent never receives it. There's no warning, error, or indication to the user that the attachment was silently dropped. This creates a misleading experience, particularly for end users who have no visibility into the channel behavior differences. What I've found so far: I'm aware this is documented as a known issue for custom engine agents in the Microsoft 365 Copilot Extensibility Known Issues page: "File attachments — Users can't upload files in agent chats and the agent can't return files for download." I also found a related GitHub issue (OfficeDev/microsoft-365-agents-toolkit #15325) where a Microsoft team member confirmed this is a "Copilot platform shortage" — not an Agents Toolkit issue — with no published ETA. My questions for the community and any Microsoft product team members: Is there any currently supported workaround to enable file/image input for a Copilot Studio agent running in the M365 Copilot app (desktop or web)? For example, any manifest configuration, agent settings, or alternate approach? Is this limitation being actively worked on? Is there a roadmap item or Microsoft 365 feature ID that can be tracked for when file attachment support is extended to custom engine agents in the M365 Copilot chat experience? Is the UI behavior (showing upload options that don't work) being addressed separately? Even if full file processing isn't ready, a visible warning or disabled state in the UI would significantly reduce user confusion. Any insight from others who have hit this — or from Microsoft PMs — is appreciated. Happy to share more configuration details if helpful. Thanks! Brian84Views0likes0CommentsCopilot Studio Agent resetting when processing PDF drawings (300MB+) via Claude 4.6 Sonnet
Hello everyone, I am building an automated drawing review verification agent inside Copilot Studio using the Claude 4.6 Sonnet model. The goal of the agent is to read a comments package (20-40MB) and verify if those design comments were successfully incorporated into a milestone drawing set (300MB–400MB). When testing this workflow natively within Claude, the model handles the token load perfectly and returns an accurate compliance/incorporation summary within approximately 20 minutes. However, when running the exact same agent setup within Copilot Studio, the conversational canvas repeatedly crashes and resets the session. I suspect I am hitting the 100-second synchronous conversational timeout or overloading the chat runtime payload limits due to the massive file sizes. Because of corporate compliance policies, this agent must live within our Microsoft tenant so it can be scaled across our operations team via Microsoft 365. How can I fix Copilot Studio to have its performance match Claude's, as it is utilizing the same agent model. I am fairly new to working with AI but am willing explore any avenue as if I can figure out a solution this will help save a lot of time for colleagues. Thanks in advance for any insights!121Views0likes2CommentsPlatform Issue: Agent-Level MCP Initialization Blocks All Topics for Users Without Connection Access
I’m writing to report a significant platform behavior issue we’ve encountered with MCP (Model Context Protocol) server initialization in Copilot Studio that is severely impacting end-user experience. ENVIRONMENT • Platform: Microsoft Copilot Studio • Deployment Channel: Microsoft Teams • Agent Type: Custom Agent (EdiSENSE DEV) • MCP Server: Atlassian Confluence-Jira MCP • Affected Users: Users without Confluence/Jira access entitlements PROBLEM SUMMARY When an MCP server is registered at the agent level in Copilot Studio, it appears to be initialized globally before any topic routing occurs. If a user’s MCP connection is in a Not Connected state — even if their query has absolutely nothing to do with that MCP server - the agent gets stuck in an authentication loop, repeatedly prompting: “Let’s get you connected first. Open connection manager to verify your credentials. Once the connection is ready, retry your request.” The user is shown Retry / Cancel buttons, and the agent never proceeds to route the query to the appropriate topic. STEPS TO REPRODUCE 1. Register an MCP server (e.g., Confluence-Jira) at the agent level in Copilot Studio 2. Deploy the agent to Microsoft Teams 3. Log in as a user who does NOT have access to the MCP-connected service (e.g., no Confluence license) 4. Ask the agent any question completely unrelated to the MCP service 5. Observe: Agent does not route to any topic. Instead, it loops on MCP connection prompt indefinitely EXPECTED BEHAVIOR • MCP initialization failure for a specific service should NOT block unrelated topic routing • The agent should gracefully degrade - if an MCP connection is unavailable for a user, it should skip that MCP and continue routing the query to relevant topics • There should be a way to scope MCP connections to specific topics, or at minimum, mark them as optional/non-blocking ACTUAL BEHAVIOR • Agent-level MCP initialization blocks the entire conversation flow • Users without MCP access cannot use ANY functionality of the agent, even features completely unrelated to the MCP service • There is no graceful fallback or bypass mechanism available to agent builders BUSINESS IMPACT This is a critical gap for enterprise deployments where: • Not all users have access to every integrated service • Agents serve a broad user base with varying entitlements • Admins have no control over MCP initialization order or failure handling In our case, a large portion of our Teams users are completely locked out of the agent’s core functionality simply because they don’t have a Confluence license - even though they never intended to use Confluence-related features. FEATURE REQUEST / SUGGESTED RESOLUTION 1. Allow MCP servers to be scoped at the topic level, not just agent level 2. Introduce an optional flag for MCP connections so initialization failure is non-blocking 3. Provide agent builders with a connection status condition node in the topic flow to handle MCP failures gracefully 4. At minimum, allow the Cancel button in the auth prompt to fall through to normal topic routing I’d appreciate any guidance on whether there is a current workaround, or if this is a known limitation on the roadmap for resolution. Thank you for your time and support.44Views0likes1CommentFoundry IQ: Improve recall by up to 54% with knowledge bases
Foundry IQ: Improve recall by up to 54% with knowledge bases. Foundry IQ (Azure AI Search) has improved its agentic retrieval engine resulting in better answer quality and improved token cost savings. We compared standalone retrieval tools to knowledge bases using the challenging BrowseComp-Plus benchmark and found: Replacing single-shot RAG with a knowledge base improves evidence recall by up to 46%. Combining a smaller agent model with agentic retrieval improves evidence recall by up to 54% while controlling costs and increasing agent responsiveness. In both cases, the amount of retrieval tool calls your agent makes is reduced, resulting in 34% token cost savings.1.3KViews3likes0CommentsMigrating to GPT-5.x Without Breaking GPT-4: A Practical, Backward-Compatible Playbook
The first request your service sends after swapping gpt-4o for gpt-5.1 in production will return HTTP 400. Not in two weeks. On the first call. And the parameter the error points to isn't one you set anywhere in your code - it's bound onto the request by a LangChain helper you've used for two years. This post walks through every breaking change between the GPT-4 and GPT-5 families on Azure OpenAI in Microsoft Foundry, the integration cliffs nobody warns you about, and the small set of files you need so the same call sites work against both model families without branching. Who this is for: engineers maintaining an existing production codebase that calls Azure OpenAI / OpenAI - directly or through LangChain - and needs to onboard GPT-5.x while keeping the GPT-4 deployments alive during rollout. What you'll leave with: one copy-paste compatibility module, a tiny LangChain subclass, a prompt-audit harness, and a 10-step rollout checklist. 1. Why this migration is different Every previous Azure OpenAI bump - 3.5 → 4, 4 → 4o, 4o → 4o-mini - was additive. You changed engine="gpt-4o" and everything kept working. GPT-5.x is the first generation that is subtractive: parameters you used to send now return 400 Unsupported parameter. The wire protocol itself changed because GPT-5 is a reasoning model - it spends tokens thinking internally before it answers, so the parameters that controlled the old sampling pipeline (temperature, top_p, presence_penalty, frequency_penalty) no longer exist on the request schema. What this means for production code: A passing test suite against gpt-4o will fail on the first call against gpt-5.1 with HTTP 400. A passing test suite against gpt-5.1 will fail on every legacy gpt-4* deployment because the new reasoning controls (reasoning_effort, verbosity) are not recognised there. LangChain helpers that worked unmodified for two years (notably create_sql_query_chain) silently bind stop=[...] onto your LLM and trigger the same 400. Source-grep won't find the offending line because it lives inside the library. The good news: the divergence is mechanical. With one detection helper, one parameter-builder, and one tiny LangChain subclass you can run the same code against both families. 2. The breaking-changes matrix Concern GPT-4 / GPT-4o (legacy) GPT-5.x / o1 / o3 (reasoning) Output budget max_tokens max_completion_tokens (rejects max_tokens) temperature 0.0–1.0 Only the default (1) is accepted - omit it top_p Supported Rejected presence_penalty, frequency_penalty Supported Rejected logprobs, logit_bias Supported Rejected stop sequences Supported Rejected on most reasoning deployments reasoning_effort Rejected New: minimal | low | medium | high verbosity Rejected New: low | medium | high (sometimes via extra_body) System instruction role system developer recommended; system still works as alias Output token cost Output tokens only Output + reasoning tokens count against your cap Recommended API version 2024-12-01-preview or earlier 2025-03-01-preview or later Two consequences are easy to miss: max_completion_tokens is a shared budget. GPT-5.1 can burn 2–4× more tokens internally before emitting the first response token. A cap of 4096 that comfortably held a SQL query on GPT-4o now silently truncates the answer mid-token on GPT-5.1. Multiply your legacy budgets by ~2.5× and add a floor (e.g. 4096) before sending. The stop parameter is the silent killer. Any helper that calls llm.bind(stop=[...]) - and there are several in langchain - will turn a working code path into a 400 the moment you swap deployments. 3. Compatibility strategy: detect, don't fork The temptation is to fork: one branch for GPT-4, one for GPT-5. Don't. The right unit of abstraction is one function that classifies the deployment into a family, and one function that builds a kwargs dict the SDK will accept for that family. Every call site - SDK, LangChain, raw HTTP - drains into the same kwargs builder. When you eventually retire GPT-4 you delete the legacy branch in one file, not in fifty. 4. The industry-agnostic compatibility module Drop the following file into your project. It has no Azure / OpenAI / LangChain imports at module load time, so the same file works from a web service, a serverless function, a notebook, or a CLI tool. 4.1 model_compat.py """ Model compatibility helper for GPT-5.x with GPT-4 backward compatibility. This module centralises the parameter translation needed to talk to the "reasoning" generation of OpenAI / Azure OpenAI models (GPT-5, GPT-5.1, o1, o3, o4) while keeping older deployments (gpt-4, gpt-4o, gpt-4-32k, gpt-3.5-turbo, etc.) working unchanged. """ from __future__ import annotations import logging import os import re from typing import Any, Dict, Iterable, Mapping, Optional # --------------------------------------------------------------------------- # Family detection # --------------------------------------------------------------------------- _REASONING_PATTERNS = ( # gpt-5, gpt5, gpt-5.1, gpt_5, GPT 5, gpt5mini-prod-eu, ... re.compile(r"(?i)(^|[^a-z0-9])gpt[-_ ]?5(\.\d+)?([^0-9]|$)"), # o1, o3, o4, o1-mini, o3-preview ... re.compile(r"(?i)(^|[^a-z0-9])o[134](-mini|-preview)?([^a-z0-9]|$)"), ) _LEGACY_PATTERNS = ( re.compile(r"(?i)gpt[-_ ]?4o"), re.compile(r"(?i)gpt[-_ ]?4(?!\d)"), re.compile(r"(?i)gpt[-_ ]?4[-_ ]?32k"), re.compile(r"(?i)gpt[-_ ]?3\.?5"), re.compile(r"(?i)gpt[-_ ]?35"), ) def get_model_family(model_or_deployment: Optional[str]) -> str: """Return ``"reasoning"`` for GPT-5.x / o-series, ``"legacy"`` otherwise. Honours an ``OPENAI_MODEL_FAMILY`` env-var override for deployments whose user-defined name does not embed the model family (e.g. ``prod-default``). """ override = (os.getenv("OPENAI_MODEL_FAMILY") or "").strip().lower() if override in {"reasoning", "gpt-5", "gpt5", "gpt-5.1", "o-series", "o1", "o3"}: return "reasoning" if override in {"legacy", "gpt-4", "gpt4", "gpt-3.5", "gpt35", "chat"}: return "legacy" name = (model_or_deployment or "").strip() if not name: # Fail closed: when we don't know, assume legacy so old code keeps # working. Misclassifying a reasoning deployment as legacy fails fast # with a clear "Unsupported parameter" 400; the reverse silently # drops parameters the caller expected. return "legacy" for pat in _REASONING_PATTERNS: if pat.search(name): return "reasoning" for pat in _LEGACY_PATTERNS: if pat.search(name): return "legacy" return "legacy" def is_reasoning_model(model_or_deployment: Optional[str]) -> bool: return get_model_family(model_or_deployment) == "reasoning" # --------------------------------------------------------------------------- # Reasoning controls # --------------------------------------------------------------------------- _VALID_REASONING_EFFORT = {"minimal", "low", "medium", "high"} _VALID_VERBOSITY = {"low", "medium", "high"} def _coerce_choice(raw: Optional[str], valid: Iterable[str]) -> Optional[str]: if raw is None: return None value = str(raw).strip().lower() if not value: return None if value not in set(valid): logging.warning( "Ignoring unsupported value '%s'; expected one of %s", raw, sorted(valid), ) return None return value def get_reasoning_effort(override: Optional[str] = None) -> Optional[str]: return _coerce_choice( override if override is not None else os.getenv("OPENAI_REASONING_EFFORT"), _VALID_REASONING_EFFORT, ) def get_verbosity(override: Optional[str] = None) -> Optional[str]: return _coerce_choice( override if override is not None else os.getenv("OPENAI_VERBOSITY"), _VALID_VERBOSITY, ) # --------------------------------------------------------------------------- # max_completion_tokens scaling # --------------------------------------------------------------------------- def _reasoning_token_scale() -> float: """Multiplier applied to legacy ``max_tokens`` when targeting a reasoning model.""" try: scale = float(os.getenv("OPENAI_REASONING_TOKEN_SCALE", "2.5")) except (TypeError, ValueError): scale = 2.5 return scale if scale > 0 else 1.0 def _reasoning_token_floor() -> int: try: floor = int(os.getenv("OPENAI_REASONING_TOKEN_FLOOR", "4096")) except (TypeError, ValueError): floor = 4096 return floor if floor > 0 else 4096 def scale_max_tokens_for_reasoning(max_tokens: Optional[int]) -> Optional[int]: """Scale a legacy ``max_tokens`` budget up for reasoning models. ``None`` and ``-1`` ("no explicit cap") are passed through. """ if max_tokens is None: return None if max_tokens == -1: return -1 return max(int(round(max_tokens * _reasoning_token_scale())), _reasoning_token_floor()) # --------------------------------------------------------------------------- # Kwargs builders # --------------------------------------------------------------------------- _SAMPLING_KEYS = ("temperature", "top_p", "presence_penalty", "frequency_penalty") def _drop_none(mapping: Mapping[str, Any]) -> Dict[str, Any]: return {k: v for k, v in mapping.items() if v is not None} def build_openai_chat_kwargs( model: str, *, max_tokens: Optional[int] = None, temperature: Optional[float] = None, top_p: Optional[float] = None, presence_penalty: Optional[float] = None, frequency_penalty: Optional[float] = None, reasoning_effort: Optional[str] = None, verbosity: Optional[str] = None, extra: Optional[Mapping[str, Any]] = None, ) -> Dict[str, Any]: """Build kwargs for ``openai.OpenAI / AzureOpenAI .chat.completions.create``. Splat the result directly: ``client.chat.completions.create(**kwargs)``. Unsupported parameters are silently omitted for reasoning models; legacy deployments retain the historical behaviour. """ family = get_model_family(model) kwargs: Dict[str, Any] = {"model": model} # ---- output budget ---- if max_tokens is not None and max_tokens != -1: if family == "reasoning": kwargs["max_completion_tokens"] = scale_max_tokens_for_reasoning(int(max_tokens)) else: kwargs["max_tokens"] = int(max_tokens) # ---- sampling ---- if family == "legacy": kwargs.update(_drop_none({ "temperature": temperature, "top_p": top_p, "presence_penalty": presence_penalty, "frequency_penalty": frequency_penalty, })) else: for key, value in ( ("temperature", temperature), ("top_p", top_p), ("presence_penalty", presence_penalty), ("frequency_penalty", frequency_penalty), ): if value is not None: logging.debug( "Dropping unsupported parameter '%s' for reasoning model '%s'", key, model, ) # ---- reasoning controls ---- if family == "reasoning": effort = get_reasoning_effort(reasoning_effort) if effort is not None: kwargs["reasoning_effort"] = effort verb = get_verbosity(verbosity) if verb is not None: # ``verbosity`` is not a top-level kwarg in openai-python <= 1.65.x; # route it via ``extra_body`` so it lands in the JSON without a # TypeError from the SDK. kwargs.setdefault("extra_body", {})["verbosity"] = verb # ---- caller-supplied extras (already filtered) ---- if extra: for key, value in extra.items(): if value is None: continue if family == "reasoning" and key in _SAMPLING_KEYS: continue kwargs[key] = value return kwargs def build_langchain_chat_kwargs( deployment_name: str, *, max_tokens: Optional[int] = None, temperature: Optional[float] = None, top_p: Optional[float] = None, reasoning_effort: Optional[str] = None, verbosity: Optional[str] = None, ) -> Dict[str, Any]: """Build kwargs for ``langchain_openai.AzureChatOpenAI`` / ``ChatOpenAI``. Older ``langchain-openai`` releases don't expose ``max_completion_tokens`` as a top-level kwarg, so we forward it through ``model_kwargs`` (which langchain passes straight to the SDK). """ family = get_model_family(deployment_name) kwargs: Dict[str, Any] = {} model_kwargs: Dict[str, Any] = {} if max_tokens is not None and max_tokens != -1: if family == "reasoning": model_kwargs["max_completion_tokens"] = scale_max_tokens_for_reasoning(int(max_tokens)) else: kwargs["max_tokens"] = int(max_tokens) if family == "reasoning": effort = get_reasoning_effort(reasoning_effort) if effort is not None: model_kwargs["reasoning_effort"] = effort verb = get_verbosity(verbosity) if verb is not None: model_kwargs.setdefault("extra_body", {})["verbosity"] = verb else: if temperature is not None: kwargs["temperature"] = temperature if top_p is not None: kwargs["top_p"] = top_p if model_kwargs: kwargs["model_kwargs"] = model_kwargs return kwargs def get_system_role(model_or_deployment: Optional[str] = None) -> str: """Return ``"developer"`` for reasoning models when opted in, ``"system"`` otherwise. Defaulting to ``"system"`` preserves compatibility with LangChain prompt templates and SDK helpers that don't yet recognise the new role. Opt in with ``OPENAI_USE_DEVELOPER_ROLE=1`` once your stack supports it. """ if not is_reasoning_model(model_or_deployment): return "system" raw = os.getenv("OPENAI_USE_DEVELOPER_ROLE", "") return "developer" if raw.strip().lower() in {"1", "true", "yes", "on"} else "system" 4.2 What this buys you Every direct-SDK call collapses to two lines: from openai import AzureOpenAI from model_compat import build_openai_chat_kwargs client = AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], ) kwargs = build_openai_chat_kwargs( model=os.environ["OPENAI_ENGINE"], max_tokens=4096, # automatically becomes max_completion_tokens for GPT-5 temperature=0.2, # automatically dropped for GPT-5 reasoning_effort="low", # automatically dropped for GPT-4 ) response = client.chat.completions.create( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": user_input}, ], **kwargs, ) The same call site now correctly targets gpt-5.1, gpt-4o, gpt-4-32k, o3-mini, or any future deployment whose name embeds the family - and you can override with the OPENAI_MODEL_FAMILY env var when the deployment alias is opaque. 4.3 Raw HTTP call sites Some legacy code paths bypass the SDK and POST JSON directly. The same builder works there: import json import requests from model_compat import build_openai_chat_kwargs, get_system_role deployment = os.environ["OPENAI_ENGINE"] api_version = os.environ["OPENAI_API_VERSION"] endpoint = ( f"{os.environ['AZURE_OPENAI_ENDPOINT']}/openai/deployments/{deployment}" f"/chat/completions?api-version={api_version}" ) payload = { "messages": [ {"role": get_system_role(deployment), "content": system_prompt}, {"role": "user", "content": user_prompt}, ], } # Splat the kwargs into the payload, then strip the SDK-only ``model`` key. payload.update(build_openai_chat_kwargs( model=deployment, max_tokens=800, temperature=0.7, top_p=0.95, reasoning_effort="low", )) payload.pop("model", None) # ``model`` is encoded in the URL for Azure payload.pop("extra_body", None) # already on the payload root resp = requests.post( endpoint, headers={"Content-Type": "application/json", "api-key": api_key}, data=json.dumps(payload), timeout=60, ) resp.raise_for_status() 5. LangChain: the hidden stop parameter langchain.chains.sql_database.query.create_sql_query_chain calls llm.bind(stop=["\nSQLResult:"]) internally to terminate the model's output before the example block in its prompt. That stop value is forwarded to the SDK on every invocation. GPT-5.1 rejects it: openai.BadRequestError: Error code: 400 - {'error': { 'message': "Unsupported parameter: 'stop' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'stop', }} You can't reach into the chain to disable it. The clean fix is a thin AzureChatOpenAI subclass that drops stop for reasoning models only: 5.1 langchain_compat.py """LangChain-side compatibility shim for reasoning-class deployments.""" from __future__ import annotations from typing import Any, List, Optional from langchain_core.callbacks.manager import ( AsyncCallbackManagerForLLMRun, CallbackManagerForLLMRun, ) from langchain_core.messages import BaseMessage from langchain_core.outputs import ChatResult from langchain_openai import AzureChatOpenAI # use ChatOpenAI for non-Azure from model_compat import is_reasoning_model class ReasoningSafeAzureChatOpenAI(AzureChatOpenAI): """``AzureChatOpenAI`` variant that hides parameters reasoning models reject. Reasoning models (GPT-5.x, o1/o3/o4) return HTTP 400 when a request payload carries ``stop``. LangChain's SQL helpers unconditionally bind it, so the unsupported parameter reaches the SDK regardless of how the caller configured the LLM. This subclass strips ``stop`` for reasoning deployments while forwarding it unchanged for legacy GPT-4 / GPT-3.5 deployments - the behaviour is byte-identical to upstream LangChain for those models. """ def _deployment_id(self) -> str: # ``langchain-openai`` >= 0.2 exposes ``azure_deployment``; older # releases use ``deployment_name``. Either may be set by the caller. return ( getattr(self, "azure_deployment", None) or getattr(self, "deployment_name", None) or "" ) def _generate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None, **kwargs: Any, ) -> ChatResult: if is_reasoning_model(self._deployment_id()): stop = None return super()._generate(messages, stop=stop, run_manager=run_manager, **kwargs) async def _agenerate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional[AsyncCallbackManagerForLLMRun] = None, **kwargs: Any, ) -> ChatResult: if is_reasoning_model(self._deployment_id()): stop = None return await super()._agenerate(messages, stop=stop, run_manager=run_manager, **kwargs) Use it as a drop-in replacement: from langchain_compat import ReasoningSafeAzureChatOpenAI from model_compat import build_langchain_chat_kwargs llm_kwargs = build_langchain_chat_kwargs( deployment_name=os.environ["OPENAI_ENGINE"], max_tokens=6000, temperature=0, reasoning_effort="low", ) llm = ReasoningSafeAzureChatOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], azure_deployment=os.environ["OPENAI_ENGINE"], openai_api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], **llm_kwargs, ) That single substitution makes create_sql_query_chain, SQLDatabaseChain, and the ChatOpenAI-based RAG helpers all work against GPT-5.1 without any other changes. 6. The second LangChain gotcha: prose where SQL should be create_sql_query_chain is documented to return the literal string "I don't know" (or a similar fallback) when the LLM cannot form a query. The default code path takes the chain output and runs it against the database: sql = chain.invoke({...}) # -> "I don't know" result = db.run(sql) # -> sends "I don't know" to pyodbc The database faithfully returns: [42000] Unclosed quotation mark after the character string 't know'. (105) Which surfaces to the end user as a misleading "SQL syntax error". The mitigation is a one-line guard that validates the chain output looks like SQL before execution: import re _SQL_START_RE = re.compile( r"^\s*(?:WITH|SELECT|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER|MERGE|EXEC|EXECUTE|TRUNCATE)\b", re.IGNORECASE, ) def looks_like_sql(text: str) -> bool: """True only if ``text`` starts with a recognised SQL DML/DDL keyword.""" if not text or not text.strip(): return False return bool(_SQL_START_RE.match(text)) sql = extract_sql_query(chain.invoke({...})) if not looks_like_sql(sql): logging.warning("SQL chain returned a non-SQL response: %r", sql[:200]) return ( "I couldn't form a SQL query for that question. " "Please rephrase or add more context." ) result = db.run(sql) This isn't specific to GPT-5.1 - it's good hygiene for any LLM that backs a SQL agent - but the failure mode becomes much more frequent on reasoning models because they're better at refusing. 7. Cleaning Markdown out of create_sql_query_chain output Reasoning models like to wrap their answer in a markdown fence and append a "Note:" or "Explanation:" paragraph. None of that survives db.run(). A defensive extract_sql_query handles all the variants: import re def extract_sql_query(text: str) -> str: """Strip markdown fences, leading prose, and trailing explanations.""" # 1) Prefer SQL inside a markdown code fence. m = re.search(r"```(?:sql|SQL|Sql)?\s*\n(.*?)\n```", text, re.DOTALL) if m: text = m.group(1) text = text.strip() # 2) Drop any prose *before* the SQL by jumping to the first SQL keyword. m = re.search( r"(?im)^\s*(WITH|SELECT|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER|MERGE|EXEC|EXECUTE|TRUNCATE)\b", text, ) if m: text = text[m.start(1):] # 3) Cut at the first "Explanation:" / "Note:" / "This query..." marker. m = re.compile( r"(?im)^\s*(?:Explanation|Note|Notes|Here(?:'|\u2019)?s|" r"This\s+(?:query|SQL|statement|returns|counts|selects|will|gets|finds)|" r"The\s+(?:query|SQL|above|result|statement)|" r"Result|Results|Description|Output|Answer)\b[^\n]*" ).search(text) if m: text = text[: m.start()].rstrip() # 4) Drop any trailing fence that survived step 1. if text.endswith("```"): text = text[:-3].rstrip() return text.strip() 8. Package versioning The bare minimum your requirements.txt / environment.yml needs: Package Last GPT-4-only version First GPT-5.x-safe version Notes openai 1.55.x 1.65.x (recommend 1.65.4+) Earlier versions reject max_completion_tokens and reasoning_effort as unknown kwargs langchain-openai 0.2.14 0.3.7+ 0.3.x line exposes azure_deployment and forwards model_kwargs correctly to the new SDK langchain 0.3.14 0.3.21+ Pin together with langchain-openai and langchain-core langchain-core 0.3.29 0.3.49+ Update in lockstep with the others langchain-community 0.3.14 0.3.20+ Mostly transitive; needed for SQLDatabase helpers tiktoken 0.7.x 0.8.0+ Encodings for GPT-5.1 ship in 0.8.0; older versions fall back to cl100k_base for unknown models tokencost (optional) 0.1.16 0.1.20+ Update for GPT-5.x price tables Azure OpenAI API version 2024-12-01-preview 2025-03-01-preview First version that ships reasoning_effort and the GPT-5.x routing Pin exact versions after testing - LangChain has a habit of moving public re-exports between minor releases. requirements.txt snippet: openai==1.65.4 langchain==0.3.21 langchain-core==0.3.49 langchain-openai==0.3.7 langchain-community==0.3.20 tiktoken==0.8.0 9. New GPT-5.x knobs worth using Once you're on a reasoning deployment, two new parameters become available. Both are optional, both default to a sensible value, and both are stripped by the kwargs builder above when the target is a legacy model. reasoning_effort minimal - one-shot lookups, classification. low - deterministic structured output (SQL, JSON-schema extraction, rule-based rewrites). Lowest cost overhead. medium (default) - RAG, summarisation, normal Q&A. high - multi-step analytical reasoning, complex code synthesis. A useful pattern is to choose the level by task profile rather than at the call site: TASK_EFFORT = { "sql": "low", "structured_extract": "low", "kg_cleaning": "low", "rag_qa": "medium", "vision": "medium", "analytical": "high", } verbosity low | medium | high. Controls the length of the response, not its substance. Useful for grounding chat UIs where you want crisp answers - set low for /answer endpoints and high for "explain like a senior engineer" panels. Note: in openai-python <= 1.65.x, verbosity is not yet a top-level keyword argument; pass it through extra_body (the builder above already does this). developer role GPT-5.x prefers {"role": "developer", "content": "..."} for instructions that previously used system. The change is non-breaking on the Azure side - system is still accepted as an alias - but some downstream LangChain prompt templates predate the role and will reject it on construction. Treat developer as opt-in (OPENAI_USE_DEVELOPER_ROLE=1) for now; flip the default after your prompt-template version is known good. 10. Auditing your existing prompts When the wire-level migration is done your service will talk to GPT-5.x - but that doesn't mean it says the right thing. Reasoning models read prompts differently in ways that won't show up as 400s: They take instructions more literally. A prompt that worked when GPT-4o rounded the corners may surface every edge case verbatim. They refuse more often. "I don't know" / "I cannot help with that" are more frequent because reasoning models are less willing to confabulate. They ignore "be concise" / "be terse". Use the new verbosity knob. Step-by-step / chain-of-thought instructions become redundant. The model already reasons internally; extra "think before you answer" prose competes with its own chain of thought and often hurts output quality. Negative-only instructions can backfire. "Never output X" prompts occasionally cause refusals where you'd rather have a workaround. 10.1 Build a prompt regression harness Capture every system+user prompt your service emits in a CSV, then replay each one against both deployments and diff the output. The diff is the single most useful artefact you can produce before the cutover: # prompt_audit.py - minimal differential tester import csv from openai import AzureOpenAI from model_compat import build_openai_chat_kwargs LEGACY = "gpt-4o" REASONING = "gpt-5.1" client = AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], ) def run(model: str, system: str, user: str) -> str: kw = build_openai_chat_kwargs( model=model, max_tokens=4096, temperature=0.2, # auto-dropped for reasoning reasoning_effort="medium", # auto-dropped for legacy ) resp = client.chat.completions.create( messages=[ {"role": "system", "content": system}, {"role": "user", "content": user}, ], **kw, ) return resp.choices[0].message.content or "" with open("prompts.csv") as f_in, open("diff.tsv", "w", newline="") as f_out: writer = csv.writer(f_out, delimiter="\t") writer.writerow(["id", "legacy_first80", "reasoning_first80", "len_legacy", "len_new", "identical"]) for row in csv.DictReader(f_in): legacy = run(LEGACY, row["system"], row["user"]) new = run(REASONING, row["system"], row["user"]) writer.writerow([ row["id"], legacy[:80].replace("\n", " "), new[:80].replace("\n", " "), len(legacy), len(new), legacy.strip() == new.strip(), ]) Capture three signals per prompt - they're enough to triage 95% of drift: Format compliance. Did the output still parse as the expected JSON / YAML / Markdown / SQL? Run your existing downstream parser on both columns. Token cost delta. Reasoning models tend to be more verbose by default. Anything beyond +20% is a candidate for the verbosity="low" knob. Semantic drift. Spot-check 5–10% of rows by hand. You're looking for changes in intent, not changes in wording. 10.2 Common rewrites to make prompts model-agnostic The goal isn't to write two prompts. It's to write one prompt that produces correct output on both families by moving constraints out of the natural-language body and into the request shape. 10.2a. Format constraints belong in response_format, not the prose Don't: Output ONLY a JSON object with keys `name` and `score`. Do not include any explanation. Do not wrap in markdown. Do not say anything else. Do: resp = client.chat.completions.create( messages=[...], response_format={ "type": "json_schema", "json_schema": { "name": "scored_entity", "schema": { "type": "object", "properties": { "name": {"type": "string"}, "score": {"type": "number"}, }, "required": ["name", "score"], "additionalProperties": False, }, "strict": True, }, }, **kw, ) response_format is honoured by both gpt-4o (>= 2024-08-06) and the entire GPT-5.x line. The prompt loses three lines of brittle natural-language constraints and you get schema-validated output for free. 10.2b. Replace "think step by step" with reasoning_effort Don't: Let's think step by step. First identify the entity. Then find the category. Then compute the score. Then format the answer. Do: delete the prose and pass reasoning_effort="medium" (or "high") for reasoning deployments. The kwargs builder drops the parameter automatically for GPT-4 models, so the same prompt now produces: step-by-step reasoning internally on GPT-5.x (lower output token cost), the same final answer on GPT-4o that the verbose prompt used to elicit. 10.2c. Replace temperature-based variety with n sampling If your code relied on temperature=0.9 to get diverse completions, GPT-5.x will return roughly the same answer every time. Generate variety the explicit way: resp = client.chat.completions.create(messages=[...], n=5, **kw) candidates = [c.message.content for c in resp.choices] Or call the model N times with slightly different framings. Both patterns work against either family with no further code changes. 10.2d. Move procedural instructions to the developer role For multi-step workflows, the new developer role gives clearer separation between what the system enforces and what the user is asking: messages = [ {"role": get_system_role(deployment), "content": role_card_for_assistant}, {"role": "developer", "content": procedural_instructions}, {"role": "user", "content": user_question}, ] get_system_role returns "system" for legacy models and "developer" for reasoning models opted in via OPENAI_USE_DEVELOPER_ROLE=1. Once your LangChain templates support the new role you can flip the default. 10.2e. Add a literal-execution header for strict formats For prompts where the exact output shape matters (table generation, SQL with a fixed column order, structured incident reports), prepend an explicit literal-execution header so reasoning models don't drift into "helpful improvements": LITERAL_EXECUTION_HEADER = ( "Execution mode: follow the instructions below literally and in order. " "Do not infer intent, skip, reorder, merge, or add steps. Honour the " "exact formatting, tone, and verbosity specified. If a step is " "ambiguous, respond with the literal interpretation and flag the " "ambiguity instead of guessing." ) def apply_literal_execution(prompt: str) -> str: if LITERAL_EXECUTION_HEADER in prompt: return prompt return f"{LITERAL_EXECUTION_HEADER}\n\n{prompt}" It's a no-op on GPT-4o (the older models already follow instructions literally enough) and a meaningful guard rail on GPT-5.1. Wire it behind an OPENAI_LITERAL_EXECUTION flag so you can disable it without redeploying. 10.3 A prompt-shaped checklist Run every prompt your service emits past these questions: Question Action Does it specify output format in prose? Move to response_format (10.2a) Does it include "think step by step"? Remove; set reasoning_effort (10.2b) Does it set tone constraints ("be concise")? Use verbosity Does it use negative-only instructions ("never X")? Add positive alternative ("do Y instead") Does it embed example outputs with values that would change? Replace concrete values with placeholder tokens (<VALUE>) Does it rely on temperature > 0 for variety? Use n=K sampling (10.2c) Is the system prompt > 2k tokens? Split into role-card (system) + procedure (developer) Does output ordering matter? Add the literal-execution header (10.2e) 10.4 Score before you ship Don't approve a rewritten prompt by eyeballing one example. Score it: Format compliance rate. Percentage of N=50 outputs that pass your existing downstream parser / JSON schema validation. Token cost delta. Cap regression at +20% versus the legacy baseline. Beyond that, dial verbosity="low" or tighten the prompt. Latency p50 / p95 delta. Reasoning models add tail latency. If your SLA is tight, set reasoning_effort="low" for the path or move it to a background queue. A prompt that regresses on any of those by more than your tolerance window ships behind a feature flag with rollback wired in. 11. Testing strategy Two test layers catch >90% of regressions: Family-classification tests import pytest from model_compat import get_model_family, build_openai_chat_kwargs @pytest.mark.parametrize("name,expected", [ ("gpt-5.1", "reasoning"), ("gpt5", "reasoning"), ("gpt-5-prod-eu", "reasoning"), ("o3-mini", "reasoning"), ("o1", "reasoning"), ("gpt-4o", "legacy"), ("gpt-4", "legacy"), ("gpt-4-32k", "legacy"), ("gpt-35-turbo", "legacy"), ("", "legacy"), # unknown -> fail closed to legacy (None, "legacy"), ]) def test_family(name, expected): assert get_model_family(name) == expected def test_kwargs_for_reasoning_drops_temperature(): kw = build_openai_chat_kwargs( model="gpt-5.1", max_tokens=1000, temperature=0.2, top_p=0.9, reasoning_effort="low", ) assert "temperature" not in kw assert "top_p" not in kw assert kw["max_completion_tokens"] >= 4096 # floor applied assert kw["reasoning_effort"] == "low" def test_kwargs_for_legacy_keeps_temperature(): kw = build_openai_chat_kwargs( model="gpt-4o", max_tokens=1000, temperature=0.2, top_p=0.9, ) assert kw["max_tokens"] == 1000 assert kw["temperature"] == 0.2 assert kw["top_p"] == 0.9 assert "reasoning_effort" not in kw Wire-level smoke tests For each LLM call site you maintain, write a single integration test that exercises the chain against a real (or mocked) endpoint and asserts: HTTP 200, non-empty content, finish_reason != "length" (so you catch silent truncation), (optional) classifier-style assertions against a golden output. Run those tests once against the legacy deployment and once against the new one - same test code, two OPENAI_ENGINE values. 12. Things that don't change It's easy to over-correct. Several pieces of plumbing keep working without modification: Authentication. AAD token providers, managed identity, and API keys are unchanged. Embeddings. text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002 are not part of the reasoning generation; the embeddings call shape is identical. Function calling / tool use. Same JSON schema, same response shape. Streaming. SSE format is unchanged. Token counters. tiktoken still works, but bump to 0.8.0+ so the new model name resolves to the right encoding instead of silently falling back to cl100k_base. 13. Next steps If you only do four things from this post, do these - in order: Deploy a GPT-5.1 model side-by-side with your current GPT-4 deployment in Microsoft Foundry. Keep the GPT-4 deployment live; you'll need both for the parallel-run period. Drop model_compat.py and langchain_compat.py into your project (Sections 4 and 5). Replace every AzureChatOpenAI(...) construction with ReasoningSafeAzureChatOpenAI and route every kwargs literal through the builders. Run the prompt-audit harness (Section 10.1) against your top 50 most frequently invoked prompts. Triage the diff with the checklist in 10.3. Roll out behind a percentage-based flag. Start at 5% of traffic for 24 hours, compare quality and cost telemetry against the GPT-4o baseline, then ramp. Reference material Azure OpenAI in Microsoft Foundry - model overview Azure OpenAI model retirements and deprecations Reasoning models in Azure OpenAI Structured Outputs in Azure OpenAI openai-python SDK changelog langchain-openai release notes Talk to us Open an issue on the Microsoft Foundry GitHub samples repository if you hit a gap this post didn't cover. Share your migration story or numbers in the comments below - field data is the fastest way to make this guide better for the next team. If you operate a regulated workload (finance, health, public sector) and need help sequencing the rollout with your model retirement deadlines, reach out to your Microsoft account team or a Microsoft Foundry partner. GPT-5.x is the first major model bump in two years that requires code changes - but the changes collapse into one small compatibility module and a one-line LangChain subclass. With those in place your code is forwards-compatible (works on reasoning models today) and backwards- compatible (still works on every GPT-4 deployment you haven't migrated yet). The investment pays a recurring dividend: when the next reasoning bump ships, the only file that needs updating is model_compat.py. Appendix A - Minimal .env template # Endpoint and auth (unchanged between families) AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<key> # The deployment name decides the family. The classifier reads it. OPENAI_ENGINE=gpt-5.1 OPENAI_API_VERSION=2025-03-01-preview # Optional override for opaque deployment names # OPENAI_MODEL_FAMILY=reasoning # or "legacy" # Optional reasoning controls (ignored for legacy deployments) OPENAI_REASONING_EFFORT=medium OPENAI_VERBOSITY=medium OPENAI_REASONING_TOKEN_SCALE=2.5 OPENAI_REASONING_TOKEN_FLOOR=4096 # Flip when your LangChain templates support it # OPENAI_USE_DEVELOPER_ROLE=1 Appendix B - One-liner sanity checks # Does a deployment name classify correctly? python -c "from model_compat import get_model_family; print(get_model_family('gpt-5.1'))" # -> reasoning # Does the LangChain LLM strip ``stop`` when the deployment is GPT-5.1? python -c " from langchain_compat import ReasoningSafeAzureChatOpenAI import inspect; print(inspect.getsource(ReasoningSafeAzureChatOpenAI._generate)) " Companion repository: drop model_compat.py and langchain_compat.py next to each other in your utils/ package. They are zero-dependency on import, so you can vendor them into any service - web, function, batch job - without dragging Azure SDK or LangChain into module-load.489Views1like0CommentsBuild an AI Resume Screening Multi-Agent
Hello everyone 👋 I recently built an AI-powered resume screening multi-agent that automates the hiring intake and candidate evaluation process. The solution automatically: Accepts resume uploads Stores resumes in SharePoint Evaluates candidates using AI Matches resumes against job requirements Generates recruiter-ready summaries Sends automated recruiter notifications Technologies Used: • Power Automate • AI Builder • SharePoint • Outlook • GPT-Based Prompting This is Part 1 of my AI Hiring Multi-Agent Series. I’d love to hear feedback from the community and learn how others are approaching AI-powered recruitment workflows. Newsletter: https://www.linkedin.com/pulse/build-ai-resume-screening-multi-agent-sajeda-sultana-k1ege/?trackingId=TQzoIUy0RveiwXLIgEKvSQ%3D%3D If you find it helpful, please share it with others who may benefit from it as well.144Views3likes2CommentsLearn How to Build Smarter AI Agents with Microsoft’s MCP Resources Hub
If you've been curious about how to build your own AI agents that can talk to APIs, connect with tools like databases, or even follow documentation you're in the right place. Microsoft has created something called MCP, which stands for Model‑Context‑Protocol. And to help you learn it step by step, they’ve made an amazing MCP Resources Hub on GitHub. In this blog, I’ll Walk you through what MCP is, why it matters, and how to use this hub to get started, even if you're new to AI development. What is MCP (Model‑Context‑Protocol)? Think of MCP like a communication bridge between your AI model and the outside world. Normally, when we chat with AI (like ChatGPT), it only knows what’s in its training data. But with MCP, you can give your AI real-time context from: APIs Documents Databases Websites This makes your AI agent smarter and more useful just like a real developer who looks up things online, checks documentation, and queries databases. What’s Inside the MCP Resources Hub? The MCP Resources Hub is a collection of everything you need to learn MCP: Videos Blogs Code examples Here are some beginner-friendly videos that explain MCP: Title What You'll Learn VS Code Agent Mode Just Changed Everything See how VS Code and MCP build an app with AI connecting to a database and following docs. The Future of AI in VS Code Learn how MCP makes GitHub Copilot smarter with real-time tools. Build MCP Servers using Azure Functions Host your own MCP servers using Azure in C#, .NET, or TypeScript. Use APIs as Tools with MCP See how to use APIs as tools inside your AI agent. Blazor Chat App with MCP + Aspire Create a chat app powered by MCP in .NET Aspire Tip: Start with the VS Code videos if you’re just beginning. Blogs Deep Dives and How-To Guides Microsoft has also written blogs that explain MCP concepts in detail. Some of the best ones include: Build AI agent tools using remote MCP with Azure Functions: Learn how to deploy MCP servers remotely using Azure. Create an MCP Server with Azure AI Agent Service : Enables Developers to create an agent with Azure AI Agent Service and uses the model context protocol (MCP) for consumption of the agents in compatible clients (VS Code, Cursor, Claude Desktop). Vibe coding with GitHub Copilot: Agent mode and MCP support: MCP allows you to equip agent mode with the context and capabilities it needs to help you, like a USB port for intelligence. When you enter a chat prompt in agent mode within VS Code, the model can use different tools to handle tasks like understanding database schema or querying the web. Enhancing AI Integrations with MCP and Azure API Management Enhance AI integrations using MCP and Azure API Management Understanding and Mitigating Security Risks in MCP Implementations Overview of security risks and mitigation strategies for MCP implementations Protecting Against Indirect Injection Attacks in MCP Strategies to prevent indirect injection attacks in MCP implementations Microsoft Copilot Studio MCP Announcement of the Microsoft Copilot Studio MCP lab Getting started with MCP for Beginners 9 part course on MCP Client and Servers Code Repositories Try it Yourself Want to build something with MCP? Microsoft has shared open-source sample code in Python, .NET, and TypeScript: Repo Name Language Description Azure-Samples/remote-mcp-apim-functions-python Python Recommended for Secure remote hosting Sample Python Azure Functions demonstrating remote MCP integration with Azure API Management Azure-Samples/remote-mcp-functions-python Python Sample Python Azure Functions demonstrating remote MCP integration Azure-Samples/remote-mcp-functions-dotnet C# Sample .NET Azure Functions demonstrating remote MCP integration Azure-Samples/remote-mcp-functions-typescript TypeScript Sample TypeScript Azure Functions demonstrating remote MCP integration Microsoft Copilot Studio MCP TypeScript Microsoft Copilot Studio MCP lab You can clone the repo, open it in VS Code, and follow the instructions to run your own MCP server. Using MCP with the AI Toolkit in Visual Studio Code To make your MCP journey even easier, Microsoft provides the AI Toolkit for Visual Studio Code. This toolkit includes: A built-in model catalog Tools to help you deploy and run models locally Seamless integration with MCP agent tools You can install the AI Toolkit extension from the Visual Studio Code Marketplace. Once installed, it helps you: Discover and select models quickly Connect those models to MCP agents Develop and test AI workflows locally before deploying to the cloud You can explore the full documentation here: Overview of the AI Toolkit for Visual Studio Code – Microsoft Learn This is perfect for developers who want to test things on their own system without needing a cloud setup right away. Why Should You Care About MCP? Because MCP: Makes your AI tools more powerful by giving them real-time knowledge Works with GitHub Copilot, Azure, and VS Code tools you may already use Is open-source and beginner-friendly with lots of tutorials and sample code It’s the future of AI development connecting models to the real world. Final Thoughts If you're learning AI or building software agents, don’t miss this valuable MCP Resources Hub. It’s like a starter kit for building smart, connected agents with Microsoft tools. Try one video or repo today. Experiment. Learn by doing and start your journey with the MCP for Beginners curricula.3.6KViews2likes2Comments