generative ai
150 TopicsFoundry IQ: Improve recall by up to 54% with knowledge bases
Foundry IQ: Improve recall by up to 54% with knowledge bases. Foundry IQ (Azure AI Search) has improved its agentic retrieval engine resulting in better answer quality and improved token cost savings. We compared standalone retrieval tools to knowledge bases using the challenging BrowseComp-Plus benchmark and found: Replacing single-shot RAG with a knowledge base improves evidence recall by up to 46%. Combining a smaller agent model with agentic retrieval improves evidence recall by up to 54% while controlling costs and increasing agent responsiveness. In both cases, the amount of retrieval tool calls your agent makes is reduced, resulting in 34% token cost savings.528Views3likes0CommentsFoundry IQ: New governance and enterprise AI security capabilities
Enterprise AI isn’t just about better retrieval—it’s about secure access to business‑critical content. Discover how Foundry IQ (Azure AI Search) enables governance, compliance, and private connectivity across agentic retrieval workflows. We are introducing the following features: - Incremental SharePoint permissions sync for indexed document content, SharePoint Lists and ASPX pages. - Purview sensitivity labels in Foundry IQ knowledge bases - Purview auditing for elevated admin queries - Private connectivity support between for Foundry IQ and Foundry resources via NSP266Views1like0CommentsMigrating to GPT-5.x Without Breaking GPT-4: A Practical, Backward-Compatible Playbook
The first request your service sends after swapping gpt-4o for gpt-5.1 in production will return HTTP 400. Not in two weeks. On the first call. And the parameter the error points to isn't one you set anywhere in your code - it's bound onto the request by a LangChain helper you've used for two years. This post walks through every breaking change between the GPT-4 and GPT-5 families on Azure OpenAI in Microsoft Foundry, the integration cliffs nobody warns you about, and the small set of files you need so the same call sites work against both model families without branching. Who this is for: engineers maintaining an existing production codebase that calls Azure OpenAI / OpenAI - directly or through LangChain - and needs to onboard GPT-5.x while keeping the GPT-4 deployments alive during rollout. What you'll leave with: one copy-paste compatibility module, a tiny LangChain subclass, a prompt-audit harness, and a 10-step rollout checklist. 1. Why this migration is different Every previous Azure OpenAI bump - 3.5 → 4, 4 → 4o, 4o → 4o-mini - was additive. You changed engine="gpt-4o" and everything kept working. GPT-5.x is the first generation that is subtractive: parameters you used to send now return 400 Unsupported parameter. The wire protocol itself changed because GPT-5 is a reasoning model - it spends tokens thinking internally before it answers, so the parameters that controlled the old sampling pipeline (temperature, top_p, presence_penalty, frequency_penalty) no longer exist on the request schema. What this means for production code: A passing test suite against gpt-4o will fail on the first call against gpt-5.1 with HTTP 400. A passing test suite against gpt-5.1 will fail on every legacy gpt-4* deployment because the new reasoning controls (reasoning_effort, verbosity) are not recognised there. LangChain helpers that worked unmodified for two years (notably create_sql_query_chain) silently bind stop=[...] onto your LLM and trigger the same 400. Source-grep won't find the offending line because it lives inside the library. The good news: the divergence is mechanical. With one detection helper, one parameter-builder, and one tiny LangChain subclass you can run the same code against both families. 2. The breaking-changes matrix Concern GPT-4 / GPT-4o (legacy) GPT-5.x / o1 / o3 (reasoning) Output budget max_tokens max_completion_tokens (rejects max_tokens) temperature 0.0–1.0 Only the default (1) is accepted - omit it top_p Supported Rejected presence_penalty, frequency_penalty Supported Rejected logprobs, logit_bias Supported Rejected stop sequences Supported Rejected on most reasoning deployments reasoning_effort Rejected New: minimal | low | medium | high verbosity Rejected New: low | medium | high (sometimes via extra_body) System instruction role system developer recommended; system still works as alias Output token cost Output tokens only Output + reasoning tokens count against your cap Recommended API version 2024-12-01-preview or earlier 2025-03-01-preview or later Two consequences are easy to miss: max_completion_tokens is a shared budget. GPT-5.1 can burn 2–4× more tokens internally before emitting the first response token. A cap of 4096 that comfortably held a SQL query on GPT-4o now silently truncates the answer mid-token on GPT-5.1. Multiply your legacy budgets by ~2.5× and add a floor (e.g. 4096) before sending. The stop parameter is the silent killer. Any helper that calls llm.bind(stop=[...]) - and there are several in langchain - will turn a working code path into a 400 the moment you swap deployments. 3. Compatibility strategy: detect, don't fork The temptation is to fork: one branch for GPT-4, one for GPT-5. Don't. The right unit of abstraction is one function that classifies the deployment into a family, and one function that builds a kwargs dict the SDK will accept for that family. Every call site - SDK, LangChain, raw HTTP - drains into the same kwargs builder. When you eventually retire GPT-4 you delete the legacy branch in one file, not in fifty. 4. The industry-agnostic compatibility module Drop the following file into your project. It has no Azure / OpenAI / LangChain imports at module load time, so the same file works from a web service, a serverless function, a notebook, or a CLI tool. 4.1 model_compat.py """ Model compatibility helper for GPT-5.x with GPT-4 backward compatibility. This module centralises the parameter translation needed to talk to the "reasoning" generation of OpenAI / Azure OpenAI models (GPT-5, GPT-5.1, o1, o3, o4) while keeping older deployments (gpt-4, gpt-4o, gpt-4-32k, gpt-3.5-turbo, etc.) working unchanged. """ from __future__ import annotations import logging import os import re from typing import Any, Dict, Iterable, Mapping, Optional # --------------------------------------------------------------------------- # Family detection # --------------------------------------------------------------------------- _REASONING_PATTERNS = ( # gpt-5, gpt5, gpt-5.1, gpt_5, GPT 5, gpt5mini-prod-eu, ... re.compile(r"(?i)(^|[^a-z0-9])gpt[-_ ]?5(\.\d+)?([^0-9]|$)"), # o1, o3, o4, o1-mini, o3-preview ... re.compile(r"(?i)(^|[^a-z0-9])o[134](-mini|-preview)?([^a-z0-9]|$)"), ) _LEGACY_PATTERNS = ( re.compile(r"(?i)gpt[-_ ]?4o"), re.compile(r"(?i)gpt[-_ ]?4(?!\d)"), re.compile(r"(?i)gpt[-_ ]?4[-_ ]?32k"), re.compile(r"(?i)gpt[-_ ]?3\.?5"), re.compile(r"(?i)gpt[-_ ]?35"), ) def get_model_family(model_or_deployment: Optional[str]) -> str: """Return ``"reasoning"`` for GPT-5.x / o-series, ``"legacy"`` otherwise. Honours an ``OPENAI_MODEL_FAMILY`` env-var override for deployments whose user-defined name does not embed the model family (e.g. ``prod-default``). """ override = (os.getenv("OPENAI_MODEL_FAMILY") or "").strip().lower() if override in {"reasoning", "gpt-5", "gpt5", "gpt-5.1", "o-series", "o1", "o3"}: return "reasoning" if override in {"legacy", "gpt-4", "gpt4", "gpt-3.5", "gpt35", "chat"}: return "legacy" name = (model_or_deployment or "").strip() if not name: # Fail closed: when we don't know, assume legacy so old code keeps # working. Misclassifying a reasoning deployment as legacy fails fast # with a clear "Unsupported parameter" 400; the reverse silently # drops parameters the caller expected. return "legacy" for pat in _REASONING_PATTERNS: if pat.search(name): return "reasoning" for pat in _LEGACY_PATTERNS: if pat.search(name): return "legacy" return "legacy" def is_reasoning_model(model_or_deployment: Optional[str]) -> bool: return get_model_family(model_or_deployment) == "reasoning" # --------------------------------------------------------------------------- # Reasoning controls # --------------------------------------------------------------------------- _VALID_REASONING_EFFORT = {"minimal", "low", "medium", "high"} _VALID_VERBOSITY = {"low", "medium", "high"} def _coerce_choice(raw: Optional[str], valid: Iterable[str]) -> Optional[str]: if raw is None: return None value = str(raw).strip().lower() if not value: return None if value not in set(valid): logging.warning( "Ignoring unsupported value '%s'; expected one of %s", raw, sorted(valid), ) return None return value def get_reasoning_effort(override: Optional[str] = None) -> Optional[str]: return _coerce_choice( override if override is not None else os.getenv("OPENAI_REASONING_EFFORT"), _VALID_REASONING_EFFORT, ) def get_verbosity(override: Optional[str] = None) -> Optional[str]: return _coerce_choice( override if override is not None else os.getenv("OPENAI_VERBOSITY"), _VALID_VERBOSITY, ) # --------------------------------------------------------------------------- # max_completion_tokens scaling # --------------------------------------------------------------------------- def _reasoning_token_scale() -> float: """Multiplier applied to legacy ``max_tokens`` when targeting a reasoning model.""" try: scale = float(os.getenv("OPENAI_REASONING_TOKEN_SCALE", "2.5")) except (TypeError, ValueError): scale = 2.5 return scale if scale > 0 else 1.0 def _reasoning_token_floor() -> int: try: floor = int(os.getenv("OPENAI_REASONING_TOKEN_FLOOR", "4096")) except (TypeError, ValueError): floor = 4096 return floor if floor > 0 else 4096 def scale_max_tokens_for_reasoning(max_tokens: Optional[int]) -> Optional[int]: """Scale a legacy ``max_tokens`` budget up for reasoning models. ``None`` and ``-1`` ("no explicit cap") are passed through. """ if max_tokens is None: return None if max_tokens == -1: return -1 return max(int(round(max_tokens * _reasoning_token_scale())), _reasoning_token_floor()) # --------------------------------------------------------------------------- # Kwargs builders # --------------------------------------------------------------------------- _SAMPLING_KEYS = ("temperature", "top_p", "presence_penalty", "frequency_penalty") def _drop_none(mapping: Mapping[str, Any]) -> Dict[str, Any]: return {k: v for k, v in mapping.items() if v is not None} def build_openai_chat_kwargs( model: str, *, max_tokens: Optional[int] = None, temperature: Optional[float] = None, top_p: Optional[float] = None, presence_penalty: Optional[float] = None, frequency_penalty: Optional[float] = None, reasoning_effort: Optional[str] = None, verbosity: Optional[str] = None, extra: Optional[Mapping[str, Any]] = None, ) -> Dict[str, Any]: """Build kwargs for ``openai.OpenAI / AzureOpenAI .chat.completions.create``. Splat the result directly: ``client.chat.completions.create(**kwargs)``. Unsupported parameters are silently omitted for reasoning models; legacy deployments retain the historical behaviour. """ family = get_model_family(model) kwargs: Dict[str, Any] = {"model": model} # ---- output budget ---- if max_tokens is not None and max_tokens != -1: if family == "reasoning": kwargs["max_completion_tokens"] = scale_max_tokens_for_reasoning(int(max_tokens)) else: kwargs["max_tokens"] = int(max_tokens) # ---- sampling ---- if family == "legacy": kwargs.update(_drop_none({ "temperature": temperature, "top_p": top_p, "presence_penalty": presence_penalty, "frequency_penalty": frequency_penalty, })) else: for key, value in ( ("temperature", temperature), ("top_p", top_p), ("presence_penalty", presence_penalty), ("frequency_penalty", frequency_penalty), ): if value is not None: logging.debug( "Dropping unsupported parameter '%s' for reasoning model '%s'", key, model, ) # ---- reasoning controls ---- if family == "reasoning": effort = get_reasoning_effort(reasoning_effort) if effort is not None: kwargs["reasoning_effort"] = effort verb = get_verbosity(verbosity) if verb is not None: # ``verbosity`` is not a top-level kwarg in openai-python <= 1.65.x; # route it via ``extra_body`` so it lands in the JSON without a # TypeError from the SDK. kwargs.setdefault("extra_body", {})["verbosity"] = verb # ---- caller-supplied extras (already filtered) ---- if extra: for key, value in extra.items(): if value is None: continue if family == "reasoning" and key in _SAMPLING_KEYS: continue kwargs[key] = value return kwargs def build_langchain_chat_kwargs( deployment_name: str, *, max_tokens: Optional[int] = None, temperature: Optional[float] = None, top_p: Optional[float] = None, reasoning_effort: Optional[str] = None, verbosity: Optional[str] = None, ) -> Dict[str, Any]: """Build kwargs for ``langchain_openai.AzureChatOpenAI`` / ``ChatOpenAI``. Older ``langchain-openai`` releases don't expose ``max_completion_tokens`` as a top-level kwarg, so we forward it through ``model_kwargs`` (which langchain passes straight to the SDK). """ family = get_model_family(deployment_name) kwargs: Dict[str, Any] = {} model_kwargs: Dict[str, Any] = {} if max_tokens is not None and max_tokens != -1: if family == "reasoning": model_kwargs["max_completion_tokens"] = scale_max_tokens_for_reasoning(int(max_tokens)) else: kwargs["max_tokens"] = int(max_tokens) if family == "reasoning": effort = get_reasoning_effort(reasoning_effort) if effort is not None: model_kwargs["reasoning_effort"] = effort verb = get_verbosity(verbosity) if verb is not None: model_kwargs.setdefault("extra_body", {})["verbosity"] = verb else: if temperature is not None: kwargs["temperature"] = temperature if top_p is not None: kwargs["top_p"] = top_p if model_kwargs: kwargs["model_kwargs"] = model_kwargs return kwargs def get_system_role(model_or_deployment: Optional[str] = None) -> str: """Return ``"developer"`` for reasoning models when opted in, ``"system"`` otherwise. Defaulting to ``"system"`` preserves compatibility with LangChain prompt templates and SDK helpers that don't yet recognise the new role. Opt in with ``OPENAI_USE_DEVELOPER_ROLE=1`` once your stack supports it. """ if not is_reasoning_model(model_or_deployment): return "system" raw = os.getenv("OPENAI_USE_DEVELOPER_ROLE", "") return "developer" if raw.strip().lower() in {"1", "true", "yes", "on"} else "system" 4.2 What this buys you Every direct-SDK call collapses to two lines: from openai import AzureOpenAI from model_compat import build_openai_chat_kwargs client = AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], ) kwargs = build_openai_chat_kwargs( model=os.environ["OPENAI_ENGINE"], max_tokens=4096, # automatically becomes max_completion_tokens for GPT-5 temperature=0.2, # automatically dropped for GPT-5 reasoning_effort="low", # automatically dropped for GPT-4 ) response = client.chat.completions.create( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": user_input}, ], **kwargs, ) The same call site now correctly targets gpt-5.1, gpt-4o, gpt-4-32k, o3-mini, or any future deployment whose name embeds the family - and you can override with the OPENAI_MODEL_FAMILY env var when the deployment alias is opaque. 4.3 Raw HTTP call sites Some legacy code paths bypass the SDK and POST JSON directly. The same builder works there: import json import requests from model_compat import build_openai_chat_kwargs, get_system_role deployment = os.environ["OPENAI_ENGINE"] api_version = os.environ["OPENAI_API_VERSION"] endpoint = ( f"{os.environ['AZURE_OPENAI_ENDPOINT']}/openai/deployments/{deployment}" f"/chat/completions?api-version={api_version}" ) payload = { "messages": [ {"role": get_system_role(deployment), "content": system_prompt}, {"role": "user", "content": user_prompt}, ], } # Splat the kwargs into the payload, then strip the SDK-only ``model`` key. payload.update(build_openai_chat_kwargs( model=deployment, max_tokens=800, temperature=0.7, top_p=0.95, reasoning_effort="low", )) payload.pop("model", None) # ``model`` is encoded in the URL for Azure payload.pop("extra_body", None) # already on the payload root resp = requests.post( endpoint, headers={"Content-Type": "application/json", "api-key": api_key}, data=json.dumps(payload), timeout=60, ) resp.raise_for_status() 5. LangChain: the hidden stop parameter langchain.chains.sql_database.query.create_sql_query_chain calls llm.bind(stop=["\nSQLResult:"]) internally to terminate the model's output before the example block in its prompt. That stop value is forwarded to the SDK on every invocation. GPT-5.1 rejects it: openai.BadRequestError: Error code: 400 - {'error': { 'message': "Unsupported parameter: 'stop' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'stop', }} You can't reach into the chain to disable it. The clean fix is a thin AzureChatOpenAI subclass that drops stop for reasoning models only: 5.1 langchain_compat.py """LangChain-side compatibility shim for reasoning-class deployments.""" from __future__ import annotations from typing import Any, List, Optional from langchain_core.callbacks.manager import ( AsyncCallbackManagerForLLMRun, CallbackManagerForLLMRun, ) from langchain_core.messages import BaseMessage from langchain_core.outputs import ChatResult from langchain_openai import AzureChatOpenAI # use ChatOpenAI for non-Azure from model_compat import is_reasoning_model class ReasoningSafeAzureChatOpenAI(AzureChatOpenAI): """``AzureChatOpenAI`` variant that hides parameters reasoning models reject. Reasoning models (GPT-5.x, o1/o3/o4) return HTTP 400 when a request payload carries ``stop``. LangChain's SQL helpers unconditionally bind it, so the unsupported parameter reaches the SDK regardless of how the caller configured the LLM. This subclass strips ``stop`` for reasoning deployments while forwarding it unchanged for legacy GPT-4 / GPT-3.5 deployments - the behaviour is byte-identical to upstream LangChain for those models. """ def _deployment_id(self) -> str: # ``langchain-openai`` >= 0.2 exposes ``azure_deployment``; older # releases use ``deployment_name``. Either may be set by the caller. return ( getattr(self, "azure_deployment", None) or getattr(self, "deployment_name", None) or "" ) def _generate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None, **kwargs: Any, ) -> ChatResult: if is_reasoning_model(self._deployment_id()): stop = None return super()._generate(messages, stop=stop, run_manager=run_manager, **kwargs) async def _agenerate( self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional[AsyncCallbackManagerForLLMRun] = None, **kwargs: Any, ) -> ChatResult: if is_reasoning_model(self._deployment_id()): stop = None return await super()._agenerate(messages, stop=stop, run_manager=run_manager, **kwargs) Use it as a drop-in replacement: from langchain_compat import ReasoningSafeAzureChatOpenAI from model_compat import build_langchain_chat_kwargs llm_kwargs = build_langchain_chat_kwargs( deployment_name=os.environ["OPENAI_ENGINE"], max_tokens=6000, temperature=0, reasoning_effort="low", ) llm = ReasoningSafeAzureChatOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], azure_deployment=os.environ["OPENAI_ENGINE"], openai_api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], **llm_kwargs, ) That single substitution makes create_sql_query_chain, SQLDatabaseChain, and the ChatOpenAI-based RAG helpers all work against GPT-5.1 without any other changes. 6. The second LangChain gotcha: prose where SQL should be create_sql_query_chain is documented to return the literal string "I don't know" (or a similar fallback) when the LLM cannot form a query. The default code path takes the chain output and runs it against the database: sql = chain.invoke({...}) # -> "I don't know" result = db.run(sql) # -> sends "I don't know" to pyodbc The database faithfully returns: [42000] Unclosed quotation mark after the character string 't know'. (105) Which surfaces to the end user as a misleading "SQL syntax error". The mitigation is a one-line guard that validates the chain output looks like SQL before execution: import re _SQL_START_RE = re.compile( r"^\s*(?:WITH|SELECT|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER|MERGE|EXEC|EXECUTE|TRUNCATE)\b", re.IGNORECASE, ) def looks_like_sql(text: str) -> bool: """True only if ``text`` starts with a recognised SQL DML/DDL keyword.""" if not text or not text.strip(): return False return bool(_SQL_START_RE.match(text)) sql = extract_sql_query(chain.invoke({...})) if not looks_like_sql(sql): logging.warning("SQL chain returned a non-SQL response: %r", sql[:200]) return ( "I couldn't form a SQL query for that question. " "Please rephrase or add more context." ) result = db.run(sql) This isn't specific to GPT-5.1 - it's good hygiene for any LLM that backs a SQL agent - but the failure mode becomes much more frequent on reasoning models because they're better at refusing. 7. Cleaning Markdown out of create_sql_query_chain output Reasoning models like to wrap their answer in a markdown fence and append a "Note:" or "Explanation:" paragraph. None of that survives db.run(). A defensive extract_sql_query handles all the variants: import re def extract_sql_query(text: str) -> str: """Strip markdown fences, leading prose, and trailing explanations.""" # 1) Prefer SQL inside a markdown code fence. m = re.search(r"```(?:sql|SQL|Sql)?\s*\n(.*?)\n```", text, re.DOTALL) if m: text = m.group(1) text = text.strip() # 2) Drop any prose *before* the SQL by jumping to the first SQL keyword. m = re.search( r"(?im)^\s*(WITH|SELECT|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER|MERGE|EXEC|EXECUTE|TRUNCATE)\b", text, ) if m: text = text[m.start(1):] # 3) Cut at the first "Explanation:" / "Note:" / "This query..." marker. m = re.compile( r"(?im)^\s*(?:Explanation|Note|Notes|Here(?:'|\u2019)?s|" r"This\s+(?:query|SQL|statement|returns|counts|selects|will|gets|finds)|" r"The\s+(?:query|SQL|above|result|statement)|" r"Result|Results|Description|Output|Answer)\b[^\n]*" ).search(text) if m: text = text[: m.start()].rstrip() # 4) Drop any trailing fence that survived step 1. if text.endswith("```"): text = text[:-3].rstrip() return text.strip() 8. Package versioning The bare minimum your requirements.txt / environment.yml needs: Package Last GPT-4-only version First GPT-5.x-safe version Notes openai 1.55.x 1.65.x (recommend 1.65.4+) Earlier versions reject max_completion_tokens and reasoning_effort as unknown kwargs langchain-openai 0.2.14 0.3.7+ 0.3.x line exposes azure_deployment and forwards model_kwargs correctly to the new SDK langchain 0.3.14 0.3.21+ Pin together with langchain-openai and langchain-core langchain-core 0.3.29 0.3.49+ Update in lockstep with the others langchain-community 0.3.14 0.3.20+ Mostly transitive; needed for SQLDatabase helpers tiktoken 0.7.x 0.8.0+ Encodings for GPT-5.1 ship in 0.8.0; older versions fall back to cl100k_base for unknown models tokencost (optional) 0.1.16 0.1.20+ Update for GPT-5.x price tables Azure OpenAI API version 2024-12-01-preview 2025-03-01-preview First version that ships reasoning_effort and the GPT-5.x routing Pin exact versions after testing - LangChain has a habit of moving public re-exports between minor releases. requirements.txt snippet: openai==1.65.4 langchain==0.3.21 langchain-core==0.3.49 langchain-openai==0.3.7 langchain-community==0.3.20 tiktoken==0.8.0 9. New GPT-5.x knobs worth using Once you're on a reasoning deployment, two new parameters become available. Both are optional, both default to a sensible value, and both are stripped by the kwargs builder above when the target is a legacy model. reasoning_effort minimal - one-shot lookups, classification. low - deterministic structured output (SQL, JSON-schema extraction, rule-based rewrites). Lowest cost overhead. medium (default) - RAG, summarisation, normal Q&A. high - multi-step analytical reasoning, complex code synthesis. A useful pattern is to choose the level by task profile rather than at the call site: TASK_EFFORT = { "sql": "low", "structured_extract": "low", "kg_cleaning": "low", "rag_qa": "medium", "vision": "medium", "analytical": "high", } verbosity low | medium | high. Controls the length of the response, not its substance. Useful for grounding chat UIs where you want crisp answers - set low for /answer endpoints and high for "explain like a senior engineer" panels. Note: in openai-python <= 1.65.x, verbosity is not yet a top-level keyword argument; pass it through extra_body (the builder above already does this). developer role GPT-5.x prefers {"role": "developer", "content": "..."} for instructions that previously used system. The change is non-breaking on the Azure side - system is still accepted as an alias - but some downstream LangChain prompt templates predate the role and will reject it on construction. Treat developer as opt-in (OPENAI_USE_DEVELOPER_ROLE=1) for now; flip the default after your prompt-template version is known good. 10. Auditing your existing prompts When the wire-level migration is done your service will talk to GPT-5.x - but that doesn't mean it says the right thing. Reasoning models read prompts differently in ways that won't show up as 400s: They take instructions more literally. A prompt that worked when GPT-4o rounded the corners may surface every edge case verbatim. They refuse more often. "I don't know" / "I cannot help with that" are more frequent because reasoning models are less willing to confabulate. They ignore "be concise" / "be terse". Use the new verbosity knob. Step-by-step / chain-of-thought instructions become redundant. The model already reasons internally; extra "think before you answer" prose competes with its own chain of thought and often hurts output quality. Negative-only instructions can backfire. "Never output X" prompts occasionally cause refusals where you'd rather have a workaround. 10.1 Build a prompt regression harness Capture every system+user prompt your service emits in a CSV, then replay each one against both deployments and diff the output. The diff is the single most useful artefact you can produce before the cutover: # prompt_audit.py - minimal differential tester import csv from openai import AzureOpenAI from model_compat import build_openai_chat_kwargs LEGACY = "gpt-4o" REASONING = "gpt-5.1" client = AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version=os.environ["OPENAI_API_VERSION"], api_key=os.environ["AZURE_OPENAI_API_KEY"], ) def run(model: str, system: str, user: str) -> str: kw = build_openai_chat_kwargs( model=model, max_tokens=4096, temperature=0.2, # auto-dropped for reasoning reasoning_effort="medium", # auto-dropped for legacy ) resp = client.chat.completions.create( messages=[ {"role": "system", "content": system}, {"role": "user", "content": user}, ], **kw, ) return resp.choices[0].message.content or "" with open("prompts.csv") as f_in, open("diff.tsv", "w", newline="") as f_out: writer = csv.writer(f_out, delimiter="\t") writer.writerow(["id", "legacy_first80", "reasoning_first80", "len_legacy", "len_new", "identical"]) for row in csv.DictReader(f_in): legacy = run(LEGACY, row["system"], row["user"]) new = run(REASONING, row["system"], row["user"]) writer.writerow([ row["id"], legacy[:80].replace("\n", " "), new[:80].replace("\n", " "), len(legacy), len(new), legacy.strip() == new.strip(), ]) Capture three signals per prompt - they're enough to triage 95% of drift: Format compliance. Did the output still parse as the expected JSON / YAML / Markdown / SQL? Run your existing downstream parser on both columns. Token cost delta. Reasoning models tend to be more verbose by default. Anything beyond +20% is a candidate for the verbosity="low" knob. Semantic drift. Spot-check 5–10% of rows by hand. You're looking for changes in intent, not changes in wording. 10.2 Common rewrites to make prompts model-agnostic The goal isn't to write two prompts. It's to write one prompt that produces correct output on both families by moving constraints out of the natural-language body and into the request shape. 10.2a. Format constraints belong in response_format, not the prose Don't: Output ONLY a JSON object with keys `name` and `score`. Do not include any explanation. Do not wrap in markdown. Do not say anything else. Do: resp = client.chat.completions.create( messages=[...], response_format={ "type": "json_schema", "json_schema": { "name": "scored_entity", "schema": { "type": "object", "properties": { "name": {"type": "string"}, "score": {"type": "number"}, }, "required": ["name", "score"], "additionalProperties": False, }, "strict": True, }, }, **kw, ) response_format is honoured by both gpt-4o (>= 2024-08-06) and the entire GPT-5.x line. The prompt loses three lines of brittle natural-language constraints and you get schema-validated output for free. 10.2b. Replace "think step by step" with reasoning_effort Don't: Let's think step by step. First identify the entity. Then find the category. Then compute the score. Then format the answer. Do: delete the prose and pass reasoning_effort="medium" (or "high") for reasoning deployments. The kwargs builder drops the parameter automatically for GPT-4 models, so the same prompt now produces: step-by-step reasoning internally on GPT-5.x (lower output token cost), the same final answer on GPT-4o that the verbose prompt used to elicit. 10.2c. Replace temperature-based variety with n sampling If your code relied on temperature=0.9 to get diverse completions, GPT-5.x will return roughly the same answer every time. Generate variety the explicit way: resp = client.chat.completions.create(messages=[...], n=5, **kw) candidates = [c.message.content for c in resp.choices] Or call the model N times with slightly different framings. Both patterns work against either family with no further code changes. 10.2d. Move procedural instructions to the developer role For multi-step workflows, the new developer role gives clearer separation between what the system enforces and what the user is asking: messages = [ {"role": get_system_role(deployment), "content": role_card_for_assistant}, {"role": "developer", "content": procedural_instructions}, {"role": "user", "content": user_question}, ] get_system_role returns "system" for legacy models and "developer" for reasoning models opted in via OPENAI_USE_DEVELOPER_ROLE=1. Once your LangChain templates support the new role you can flip the default. 10.2e. Add a literal-execution header for strict formats For prompts where the exact output shape matters (table generation, SQL with a fixed column order, structured incident reports), prepend an explicit literal-execution header so reasoning models don't drift into "helpful improvements": LITERAL_EXECUTION_HEADER = ( "Execution mode: follow the instructions below literally and in order. " "Do not infer intent, skip, reorder, merge, or add steps. Honour the " "exact formatting, tone, and verbosity specified. If a step is " "ambiguous, respond with the literal interpretation and flag the " "ambiguity instead of guessing." ) def apply_literal_execution(prompt: str) -> str: if LITERAL_EXECUTION_HEADER in prompt: return prompt return f"{LITERAL_EXECUTION_HEADER}\n\n{prompt}" It's a no-op on GPT-4o (the older models already follow instructions literally enough) and a meaningful guard rail on GPT-5.1. Wire it behind an OPENAI_LITERAL_EXECUTION flag so you can disable it without redeploying. 10.3 A prompt-shaped checklist Run every prompt your service emits past these questions: Question Action Does it specify output format in prose? Move to response_format (10.2a) Does it include "think step by step"? Remove; set reasoning_effort (10.2b) Does it set tone constraints ("be concise")? Use verbosity Does it use negative-only instructions ("never X")? Add positive alternative ("do Y instead") Does it embed example outputs with values that would change? Replace concrete values with placeholder tokens (<VALUE>) Does it rely on temperature > 0 for variety? Use n=K sampling (10.2c) Is the system prompt > 2k tokens? Split into role-card (system) + procedure (developer) Does output ordering matter? Add the literal-execution header (10.2e) 10.4 Score before you ship Don't approve a rewritten prompt by eyeballing one example. Score it: Format compliance rate. Percentage of N=50 outputs that pass your existing downstream parser / JSON schema validation. Token cost delta. Cap regression at +20% versus the legacy baseline. Beyond that, dial verbosity="low" or tighten the prompt. Latency p50 / p95 delta. Reasoning models add tail latency. If your SLA is tight, set reasoning_effort="low" for the path or move it to a background queue. A prompt that regresses on any of those by more than your tolerance window ships behind a feature flag with rollback wired in. 11. Testing strategy Two test layers catch >90% of regressions: Family-classification tests import pytest from model_compat import get_model_family, build_openai_chat_kwargs @pytest.mark.parametrize("name,expected", [ ("gpt-5.1", "reasoning"), ("gpt5", "reasoning"), ("gpt-5-prod-eu", "reasoning"), ("o3-mini", "reasoning"), ("o1", "reasoning"), ("gpt-4o", "legacy"), ("gpt-4", "legacy"), ("gpt-4-32k", "legacy"), ("gpt-35-turbo", "legacy"), ("", "legacy"), # unknown -> fail closed to legacy (None, "legacy"), ]) def test_family(name, expected): assert get_model_family(name) == expected def test_kwargs_for_reasoning_drops_temperature(): kw = build_openai_chat_kwargs( model="gpt-5.1", max_tokens=1000, temperature=0.2, top_p=0.9, reasoning_effort="low", ) assert "temperature" not in kw assert "top_p" not in kw assert kw["max_completion_tokens"] >= 4096 # floor applied assert kw["reasoning_effort"] == "low" def test_kwargs_for_legacy_keeps_temperature(): kw = build_openai_chat_kwargs( model="gpt-4o", max_tokens=1000, temperature=0.2, top_p=0.9, ) assert kw["max_tokens"] == 1000 assert kw["temperature"] == 0.2 assert kw["top_p"] == 0.9 assert "reasoning_effort" not in kw Wire-level smoke tests For each LLM call site you maintain, write a single integration test that exercises the chain against a real (or mocked) endpoint and asserts: HTTP 200, non-empty content, finish_reason != "length" (so you catch silent truncation), (optional) classifier-style assertions against a golden output. Run those tests once against the legacy deployment and once against the new one - same test code, two OPENAI_ENGINE values. 12. Things that don't change It's easy to over-correct. Several pieces of plumbing keep working without modification: Authentication. AAD token providers, managed identity, and API keys are unchanged. Embeddings. text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002 are not part of the reasoning generation; the embeddings call shape is identical. Function calling / tool use. Same JSON schema, same response shape. Streaming. SSE format is unchanged. Token counters. tiktoken still works, but bump to 0.8.0+ so the new model name resolves to the right encoding instead of silently falling back to cl100k_base. 13. Next steps If you only do four things from this post, do these - in order: Deploy a GPT-5.1 model side-by-side with your current GPT-4 deployment in Microsoft Foundry. Keep the GPT-4 deployment live; you'll need both for the parallel-run period. Drop model_compat.py and langchain_compat.py into your project (Sections 4 and 5). Replace every AzureChatOpenAI(...) construction with ReasoningSafeAzureChatOpenAI and route every kwargs literal through the builders. Run the prompt-audit harness (Section 10.1) against your top 50 most frequently invoked prompts. Triage the diff with the checklist in 10.3. Roll out behind a percentage-based flag. Start at 5% of traffic for 24 hours, compare quality and cost telemetry against the GPT-4o baseline, then ramp. Reference material Azure OpenAI in Microsoft Foundry - model overview Azure OpenAI model retirements and deprecations Reasoning models in Azure OpenAI Structured Outputs in Azure OpenAI openai-python SDK changelog langchain-openai release notes Talk to us Open an issue on the Microsoft Foundry GitHub samples repository if you hit a gap this post didn't cover. Share your migration story or numbers in the comments below - field data is the fastest way to make this guide better for the next team. If you operate a regulated workload (finance, health, public sector) and need help sequencing the rollout with your model retirement deadlines, reach out to your Microsoft account team or a Microsoft Foundry partner. GPT-5.x is the first major model bump in two years that requires code changes - but the changes collapse into one small compatibility module and a one-line LangChain subclass. With those in place your code is forwards-compatible (works on reasoning models today) and backwards- compatible (still works on every GPT-4 deployment you haven't migrated yet). The investment pays a recurring dividend: when the next reasoning bump ships, the only file that needs updating is model_compat.py. Appendix A - Minimal .env template # Endpoint and auth (unchanged between families) AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<key> # The deployment name decides the family. The classifier reads it. OPENAI_ENGINE=gpt-5.1 OPENAI_API_VERSION=2025-03-01-preview # Optional override for opaque deployment names # OPENAI_MODEL_FAMILY=reasoning # or "legacy" # Optional reasoning controls (ignored for legacy deployments) OPENAI_REASONING_EFFORT=medium OPENAI_VERBOSITY=medium OPENAI_REASONING_TOKEN_SCALE=2.5 OPENAI_REASONING_TOKEN_FLOOR=4096 # Flip when your LangChain templates support it # OPENAI_USE_DEVELOPER_ROLE=1 Appendix B - One-liner sanity checks # Does a deployment name classify correctly? python -c "from model_compat import get_model_family; print(get_model_family('gpt-5.1'))" # -> reasoning # Does the LangChain LLM strip ``stop`` when the deployment is GPT-5.1? python -c " from langchain_compat import ReasoningSafeAzureChatOpenAI import inspect; print(inspect.getsource(ReasoningSafeAzureChatOpenAI._generate)) " Companion repository: drop model_compat.py and langchain_compat.py next to each other in your utils/ package. They are zero-dependency on import, so you can vendor them into any service - web, function, batch job - without dragging Azure SDK or LangChain into module-load.352Views0likes0CommentsEmbracing Responsible AI: A Comprehensive Guide and Call to Action
In an age where artificial intelligence (AI) is becoming increasingly integrated into our daily lives, the need for responsible AI practices has never been more critical. From healthcare to finance, AI systems influence decisions affecting millions of people. As developers, organizations, and users, we are responsible for ensuring that these technologies are designed, deployed, and evaluated ethically. This blog will delve into the principles of responsible AI, the importance of assessing generative AI applications, and provide a call to action to engage with the Microsoft Learn Module on responsible AI evaluations. What is Responsible AI? Responsible AI encompasses a set of principles and practices aimed at ensuring that AI technologies are developed and used in ways that are ethical, fair, and accountable. Here are the core principles that define responsible AI: Fairness AI systems must be designed to avoid bias and discrimination. This means ensuring that the data used to train these systems is representative and that the algorithms do not favor one group over another. Fairness is crucial in applications like hiring, lending, and law enforcement, where biased AI can lead to significant societal harm. Transparency Transparency involves making AI systems understandable to users and stakeholders. This includes providing clear explanations of how AI models make decisions and what data they use. Transparency builds trust and allows users to challenge or question AI decisions when necessary. Accountability Developers and organizations must be held accountable for the outcomes of their AI systems. This includes establishing clear lines of responsibility for AI decisions and ensuring that there are mechanisms in place to address any negative consequences that arise from AI use. Privacy AI systems often rely on vast amounts of data, raising concerns about user privacy. Responsible AI practices involve implementing robust data protection measures, ensuring compliance with regulations like GDPR, and being transparent about how user data is collected, stored, and used. The Importance of Evaluating Generative AI Applications Generative AI, which includes technologies that can create text, images, music, and more, presents unique challenges and opportunities. Evaluating these applications is essential for several reasons: Quality Assessment Evaluating the output quality of generative AI applications is crucial to ensure that they meet user expectations and ethical standards. Poor-quality outputs can lead to misinformation, misrepresentation, and a loss of trust in AI technologies. Custom Evaluators Learning to create and use custom evaluators allows developers to tailor assessments to specific applications and contexts. This flexibility is vital in ensuring that the evaluation process aligns with the intended use of the AI system. Synthetic Datasets Generative AI can be used to create synthetic datasets, which can help in training AI models while addressing privacy concerns and data scarcity. Evaluating these synthetic datasets is essential to ensure they are representative and do not introduce bias. Call to Action: Engage with the Microsoft Learn Module To deepen your understanding of responsible AI and enhance your skills in evaluating generative AI applications, I encourage you to explore the Microsoft Learn Module available at this link. What You Will Learn: Concepts and Methodologies: The module covers essential frameworks for evaluating generative AI, including best practices and methodologies that can be applied across various domains. Hands-On Exercises: Engage in practical, code-first exercises that simulate real-world scenarios. These exercises will help you apply the concepts learned tangibly, reinforcing your understanding. Prerequisites: An Azure subscription (you can create one for free). Basic familiarity with Azure and Python programming. Tools like Docker and Visual Studio Code for local development. Why This Matters By participating in this module, you are not just enhancing your skills; you are contributing to a broader movement towards responsible AI. As AI technologies continue to evolve, the demand for professionals who understand and prioritize ethical considerations will only grow. Your engagement in this learning journey can help shape the future of AI, ensuring it serves humanity positively and equitably. Conclusion As we navigate the complexities of AI technology, we must prioritize responsible AI practices. By engaging with educational resources like the Microsoft Learn Module on responsible AI evaluations, we can equip ourselves with the knowledge and skills necessary to create AI systems that are not only innovative but also ethical and responsible. Join the movement towards responsible AI today! Take the first step by exploring the Microsoft Learn Module and become an advocate for ethical AI practices in your community and beyond. Together, we can ensure that AI serves as a force for good in our society. References Evaluate generative AI applications https://learn.microsoft.com/en-us/training/paths/evaluate-generative-ai-apps/?wt.mc_id=studentamb_263805 Azure Subscription for Students https://azure.microsoft.com/en-us/free/students/?wt.mc_id=studentamb_263805 Visual Studio Code https://code.visualstudio.com/?wt.mc_id=studentamb_263805920Views0likes0CommentsFine-Tuning and Deploying Phi-3.5 Model with Azure and AI Toolkit
What is Phi-3.5? Phi-3.5 as a state-of-the-art language model with strong multilingual capabilities. Emphasize that it is designed to handle multiple languages with high proficiency, making it a versatile tool for Natural Language Processing (NLP) tasks across different linguistic backgrounds. Key Features of Phi-3.5 Highlight the core features of the Phi-3.5 model: Multilingual Capabilities: Explain that the model supports a wide variety of languages, including major world languages such as English, Spanish, Chinese, French, and others. You can provide an example of its ability to handle a sentence or document translation from one language to another without losing context or meaning. Fine-Tuning Ability: Discuss how the model can be fine-tuned for specific use cases. For instance, in a customer support setting, the Phi-3.5 model can be fine-tuned to understand the nuances of different languages used by customers across the globe, improving response accuracy. High Performance in NLP Tasks: Phi-3.5 is optimized for tasks like text classification, machine translation, summarization, and more. It has superior performance in handling large-scale datasets and producing coherent, contextually correct language outputs. Applications in Real-World Scenarios To make this section more engaging, provide a few real-world applications where the Phi-3.5 model can be utilized: Customer Support Chatbots: For companies with global customer bases, the model’s multilingual support can enhance chatbot capabilities, allowing for real-time responses in a customer’s native language, no matter where they are located. Content Creation for Global Markets: Discuss how businesses can use Phi-3.5 to automatically generate or translate content for different regions. For example, marketing copy can be adapted to fit cultural and linguistic nuances in multiple languages. Document Summarization Across Languages: Highlight how the model can be used to summarize long documents or articles written in one language and then translate the summary into another language, improving access to information for non-native speakers. Why Choose Phi-3.5 for Your Project? End this section by emphasizing why someone should use Phi-3.5: Versatility: It’s not limited to just one or two languages but performs well across many. Customization: The ability to fine-tune it for particular use cases or industries makes it highly adaptable. Ease of Deployment: With tools like Azure ML and Ollama, deploying Phi-3.5 in the cloud or locally is accessible even for smaller teams. Objective Of Blog Specialized Language Models (SLMs) are at the forefront of advancements in Natural Language Processing, offering fine-tuned, high-performance solutions for specific tasks and languages. Among these, the Phi-3.5 model has emerged as a powerful tool, excelling in its multilingual capabilities. Whether you're working with English, Spanish, Mandarin, or any other major world language, Phi-3.5 offers robust, reliable language processing that adapts to various real-world applications. This makes it an ideal choice for businesses looking to deploy multilingual chatbots, automate content generation, or translate customer interactions in real time. Moreover, its fine-tuning ability allows for customization, making Phi-3.5 versatile across industries and tasks. Customization and Fine-Tuning for Different Applications The Phi-3.5 model is not just limited to general language understanding tasks. It can be fine-tuned for specific applications, industries, and language models, allowing users to tailor its performance to meet their needs. Customizable for Industry-Specific Use Cases: With fine-tuning, the model can be trained further on domain-specific data to handle particular use cases like legal document translation, medical records analysis, or technical support. Example: A healthcare company can fine-tune Phi-3.5 to understand medical terminology in multiple languages, enabling it to assist in processing patient records or generating multilingual health reports. Adapting for Specialized Tasks: You can train Phi-3.5 to perform specialized tasks like sentiment analysis, text summarization, or named entity recognition in specific languages. Fine-tuning helps enhance the model's ability to handle unique text formats or requirements. Example: A marketing team can fine-tune the model to analyse customer feedback in different languages to identify trends or sentiment across various regions. The model can quickly classify feedback as positive, negative, or neutral, even in less widely spoken languages like Arabic or Korean. Applications in Real-World Scenarios To illustrate the versatility of Phi-3.5, here are some real-world applications where this model excels, demonstrating its multilingual capabilities and customization potential: Case Study 1: Multilingual Customer Support Chatbots Many global companies rely on chatbots to handle customer queries in real-time. With Phi-3.5’s multilingual abilities, businesses can deploy a single model that understands and responds in multiple languages, cutting down on the need to create language-specific chatbots. Example: A global airline can use Phi-3.5 to power its customer service bot. Passengers from different countries can inquire about their flight status or baggage policies in their native languages—whether it's Japanese, Hindi, or Portuguese—and the model responds accurately in the appropriate language. Case Study 2: Multilingual Content Generation Phi-3.5 is also useful for businesses that need to generate content in different languages. For example, marketing campaigns often require creating region-specific ads or blog posts in multiple languages. Phi-3.5 can help automate this process by generating localized content that is not just translated but adapted to fit the cultural context of the target audience. Example: An international cosmetics brand can use Phi-3.5 to automatically generate product descriptions for different regions. Instead of merely translating a product description from English to Spanish, the model can tailor the description to fit cultural expectations, using language that resonates with Spanish-speaking audiences. Case Study 3: Document Translation and Summarization Phi-3.5 can be used to translate or summarize complex documents across languages. Its ability to preserve meaning and context across languages makes it ideal for industries where accuracy is crucial, such as legal or academic fields. Example: A legal firm working on cross-border cases can use Phi-3.5 to translate contracts or legal briefs from German to English, ensuring the context and legal terminology are accurately preserved. It can also summarize lengthy documents in multiple languages, saving time for legal teams. Fine-Tuning Phi-3.5 Model Fine-tuning a language model like Phi-3.5 is a crucial step in adapting it to perform specific tasks or cater to specific domains. This section will walk through what fine-tuning is, its importance in NLP, and how to fine-tune the Phi-3.5 model using Azure Model Catalog for different languages and tasks. We'll also explore a code example and best practices for evaluating and validating the fine-tuned model. What is Fine-Tuning? Fine-tuning refers to the process of taking a pre-trained model and adapting it to a specific task or dataset by training it further on domain-specific data. In the context of NLP, fine-tuning is often required to ensure that the language model understands the nuances of a particular language, industry-specific terminology, or a specific use case. Why Fine-Tuning is Necessary Pre-trained Large Language Models (LLMs) are trained on diverse datasets and can handle various tasks like text summarization, generation, and question answering. However, they may not perform optimally in specialized domains without fine-tuning. The goal of fine-tuning is to enhance the model's performance on specific tasks by leveraging its prior knowledge while adapting it to new contexts. Challenges of Fine-Tuning Resource Intensiveness: Fine-tuning large models can be computationally expensive, requiring significant hardware resources. Storage Costs: Each fine-tuned model can be large, leading to increased storage needs when deploying multiple models for different tasks. LoRA and QLoRA To address these challenges, techniques like LoRA (Low-rank Adaptation) and QLoRA (Quantized Low-rank Adaptation) have emerged. Both methods aim to make the fine-tuning process more efficient: LoRA: This technique reduces the number of trainable parameters by introducing low-rank matrices into the model while keeping the original model weights frozen. This approach minimizes memory usage and speeds up the fine-tuning process. QLoRA: An enhancement of LoRA, QLoRA incorporates quantization techniques to further reduce memory requirements and increase the efficiency of the fine-tuning process. It allows for the deployment of large models on consumer hardware without the extensive resource demands typically associated with full fine-tuning. from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments from peft import get_peft_model, LoraConfig # Load a pre-trained model model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") # Configure LoRA lora_config = LoraConfig( r=16, # Rank lora_alpha=32, lora_dropout=0.1, ) # Wrap the model with LoRA model = get_peft_model(model, lora_config) # Define training arguments training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, ) # Create a Trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) # Start fine-tuning trainer.train() This code outlines how to set up a model for fine-tuning using LoRA, which can significantly reduce the resource requirements while still adapting the model effectively to specific tasks. In summary, fine-tuning with methods like LoRA and QLoRA is essential for optimizing pre-trained models for specific applications in NLP, making it feasible to deploy these powerful models in various domains efficiently. Why is Fine-Tuning Important in NLP? Task-Specific Performance: Fine-tuning helps improve performance for tasks like text classification, machine translation, or sentiment analysis in specific domains (e.g., legal, healthcare). Language-Specific Adaptation: Since models like Phi-3.5 are trained on general datasets, fine-tuning helps them handle industry-specific jargon or linguistic quirks. Efficient Resource Utilization: Instead of training a model from scratch, fine-tuning leverages pre-trained knowledge, saving computational resources and time. Steps to Fine-Tune Phi-3.5 in Azure AI Foundry Fine-tuning the Phi-3.5 model in Azure AI Foundry involves several key steps. Azure provides a user-friendly interface to streamline model customization, allowing you to quickly configure, train, and deploy models. Step 1: Setting Up the Environment in Azure AI Foundry Access Azure AI Foundry: Log in to Azure AI Foundry. If you don’t have an account, you can create one and set up a workspace. Create a New Experiment: Once in the Azure AI Foundry, create a new training experiment. Choose the Phi-3.5 model from the pre-trained models provided in the Azure Model Zoo. Set Up the Data for Fine-Tuning: Upload your custom dataset for fine-tuning. Ensure the dataset is in a compatible format (e.g., CSV, JSON). For instance, if you are fine-tuning the model for a customer service chatbot, you could upload customer queries in different languages. Step 2: Configure Fine-Tuning Settings Select the Training Dataset: Select the dataset you uploaded and link it to the Phi-3.5 model. 2) Configure the Hyperparameters: Set up training hyperparameters like the number of epochs, learning rate, and batch size. You may need to experiment with these settings to achieve optimal performance. 3) Choose the Task Type: Specify the task you are fine-tuning for, such as text classification, translation, or summarization. This helps Azure AI Foundry understand how to optimize the model during fine-tuning. 4) Fine-Tuning for Specific Languages: If you are fine-tuning for a specific language or multilingual tasks, ensure that the dataset is labeled appropriately and contains enough examples in the target language(s). This will allow Phi-3.5 to learn language-specific features effectively. Step 3: Train the Model Launch the Training Process: Once the configuration is complete, launch the training process in Azure AI Foundry. Depending on the size of your dataset and the complexity of the model, this could take some time. Monitor Training Progress: Use Azure AI Foundry’s built-in monitoring tools to track performance metrics such as loss, accuracy, and F1 score. You can view the model’s progress during training to ensure that it is learning effectively. Code Example: Fine-Tuning Phi-3.5 for a Specific Use Case Here's a code snippet for fine-tuning the Phi-3.5 model using Python and Azure AI Foundry SDK. In this example, we are fine-tuning the model for a customer support chatbot in multiple languages. from azure.ai import Foundry from azure.ai.model import Model # Initialize Azure AI Foundry foundry = Foundry() # Load the Phi-3.5 model model = Model.load("phi-3.5") # Set up the training dataset training_data = foundry.load_dataset("customer_queries_dataset") # Fine-tune the model model.fine_tune(training_data, epochs=5, learning_rate=0.001) # Save the fine-tuned model model.save("fine_tuned_phi_3.5") Best Practices for Evaluating and Validating Fine-Tuned Models Once the model is fine-tuned, it's essential to evaluate and validate its performance before deploying it in production. Split Data for Validation: Always split your dataset into training and validation sets. This ensures that the model is evaluated on unseen data to prevent overfitting. Evaluate Key Metrics: Measure performance using key metrics such as: Accuracy: The proportion of correct predictions. F1 Score: A measure of precision and recall. Confusion Matrix: Helps visualize true vs. false predictions for classification tasks. Cross-Language Validation: If the model is fine-tuned for multiple languages, test its performance across all supported languages to ensure consistency and accuracy. Test in Production-Like Environments: Before full deployment, test the fine-tuned model in a production-like environment to catch any potential issues. Continuous Monitoring and Re-Fine-Tuning: Once deployed, continuously monitor the model’s performance and re-fine-tune it periodically as new data becomes available. Deploying Phi-3.5 Model After fine-tuning the Phi-3.5 model, the next crucial step is deploying it to make it accessible for real-world applications. This section will cover two key deployment strategies: deploying in Azure for cloud-based scaling and reliability, and deploying locally with AI Toolkit for simpler offline usage. Each deployment strategy offers its own advantages depending on the use case. Deploying in Azure Azure provides a powerful environment for deploying machine learning models at scale, enabling organizations to deploy models like Phi-3.5 with high availability, scalability, and robust security features. Azure AI Foundry simplifies the entire deployment pipeline. Set Up Azure AI Foundry Workspace: Log in to Azure AI Foundry and navigate to the workspace where the Phi-3.5 model was fine-tuned. Go to the Deployments section and create a new deployment environment for the model. Choose Compute Resources: Compute Target: Select a compute target suitable for your deployment. For large-scale usage, it’s advisable to choose a GPU-based compute instance. Example: Choose an Azure Kubernetes Service (AKS) cluster for handling large-scale requests efficiently. Configure Scaling Options: Azure allows you to set up auto-scaling based on traffic. This ensures that the model can handle surges in demand without affecting performance. Model Deployment Configuration: Create an Inference Pipeline: In Azure AI Foundry, set up an inference pipeline for your model. Specify the Model: Link the fine-tuned Phi-3.5 model to the deployment pipeline. Deploy the Model: Select the option to deploy the model to the chosen compute resource. Test the Deployment: Once the model is deployed, test the endpoint by sending sample requests to verify the predictions. Configuration Steps (Compute, Resources, Scaling) During deployment, Azure AI Foundry allows you to configure essential aspects like compute type, resource allocation, and scaling options. Compute Type: Choose between CPU or GPU clusters depending on the computational intensity of the model. Resource Allocation: Define the minimum and maximum resources to be allocated for the deployment. For real-time applications, use Azure Kubernetes Service (AKS) for high availability. For batch inference, Azure Container Instances (ACI) is suitable. Auto-Scaling: Set up automatic scaling of the compute instances based on the number of requests. For example, configure the deployment to start with 1 node and scale to 10 nodes during peak usage. Cost Comparison: Phi-3.5 vs. Larger Language Models When comparing the costs of using Phi-3.5 with larger language models (LLMs), several factors come into play, including computational resources, pricing structures, and performance efficiency. Here’s a breakdown: Cost Efficiency Phi-3.5: Designed as a Small Language Model (SLM), Phi-3.5 is optimized for lower computational costs. It offers competitive performance at a fraction of the cost of larger models, making it suitable for budget-conscious projects. The smaller size (3.8 billion parameters) allows for reduced resource consumption during both training and inference. Larger Language Models (e.g., GPT-3.5): Typically require more computational resources, leading to higher operational costs. Larger models may incur additional costs for storage and processing power, especially in cloud environments. Performance vs. Cost Performance Parity: Phi-3.5 has been shown to achieve performance parity with larger models on various benchmarks, including language comprehension and reasoning tasks. This means that for many applications, Phi-3.5 can deliver similar results to larger models without the associated costs. Use Case Suitability: For simpler tasks or applications that do not require extensive factual knowledge, Phi-3.5 is often the more cost-effective choice. Larger models may still be preferred for complex tasks requiring deep contextual understanding or extensive factual recall. Pricing Structure Azure Pricing: Phi-3.5 is available through Azure with a pay-as-you-go billing model, allowing users to scale costs based on usage. Pricing details for Phi-3.5 can be found on the Azure pricing page, where users can customize options based on their needs. Code Example: API Setup and Endpoints for Live Interaction Below is a Python code snippet demonstrating how to interact with a deployed Phi-3.5 model via an API in Azure: import requests # Define the API endpoint and your API key api_url = "https://<your-azure-endpoint>/predict" api_key = "YOUR_API_KEY" # Prepare the input data input_data = { "text": "What are the benefits of renewable energy?" } # Make the API request response = requests.post(api_url, json=input_data, headers={"Authorization": f"Bearer {api_key}"}) # Print the model's response if response.status_code == 200: print("Model Response:", response.json()) else: print("Error:", response.status_code, response.text) Deploying Locally with AI Toolkit For developers who prefer to run models on their local machines, the AI Toolkit provides a convenient solution. The AI Toolkit is a lightweight platform that simplifies local deployment of AI models, allowing for offline usage, experimentation, and rapid prototyping. Deploying the Phi-3.5 model locally using the AI Toolkit is straightforward and can be used for personal projects, testing, or scenarios where cloud access is limited. Introduction to AI Toolkit The AI Toolkit is an easy-to-use platform for deploying language models locally without relying on cloud infrastructure. It supports a range of AI models and enables developers to work in a low-latency environment. Advantages of deploying locally with AI Toolkit: Offline Capability: No need for continuous internet access. Quick Experimentation: Rapid prototyping and testing without the delays of cloud deployments. Setup Guide: Installing and Running Phi-3.5 Locally Using AI Toolkit Install AI Toolkit: Go to the AI Toolkit website and download the platform for your operating system (Linux, macOS, or Windows). Install AI Toolkit by running the appropriate installation command in your terminal. Download the Phi-3.5 Model: Once AI Toolkit is installed, you can download the Phi-3.5 model locally by running: 3. Run the Model Locally: After downloading the model, start a local session by running: This will launch a local server on your machine where the model will be available for interaction. Code Example: Using Phi-3.5 Locally in a Project Below is a Python code example demonstrating how to send a query to the locally deployed Phi-3.5 model running on the AI Toolkit. import requests # Define the local endpoint local_url = "http://localhost:8000/predict" # Prepare the input data input_data = { "text": "What are the benefits of renewable energy?" } # Make the API request response = requests.post(local_url, json=input_data) # Print the model's response if response.status_code == 200: print("Model Response:", response.json()) else: print("Error:", response.status_code, response.text) Comparing Language Capabilities Test Results: How Phi-3.5 Handles Different Languages The Phi-3.5 model demonstrates robust multilingual capabilities, effectively processing and generating text in various languages. Below are comparative examples showcasing its performance in English, Spanish, and Mandarin: English Example: Input: "What are the benefits of renewable energy?" Output: "Renewable energy sources, such as solar and wind, reduce greenhouse gas emissions and promote sustainability." Spanish Example: Input: "¿Cuáles son los beneficios de la energía renovable?" Output: "Las fuentes de energía renovable, como la solar y la eólica, reducen las emisiones de gases de efecto invernadero y promueven la sostenibilidad." Mandarin Example: Input: "可再生能源的好处是什么?" Output: "可再生能源,如太阳能和风能,减少温室气体排放,促进可持续发展。" Performance Benchmarking and Evaluation Across Different Languages Benchmarking Phi-3.5 across different languages involves evaluating its accuracy, fluency, and contextual understanding. For instance, using BLEU scores and human evaluations, the model can be assessed on its translation quality and coherence in various languages. Real-World Use Case: Multilingual Customer Service Chatbot A practical application of Phi-3.5's multilingual capabilities is in developing a customer service chatbot that can interact with users in their preferred language. For instance, the chatbot could provide support in English, Spanish, and Mandarin, ensuring a wider reach and better user experience. Optimizing and Validating Phi-3.5 Model Model Performance Metrics To validate the model's performance in different scenarios, consider the following metrics: Accuracy: Measure how often the model's outputs are correct or align with expected results. Fluency: Assess the naturalness and readability of the generated text. Contextual Understanding: Evaluate how well the model understands and responds to context-specific queries. Tools to Use in Azure and Ollama for Evaluation Azure Cognitive Services: Utilize tools like Text Analytics and Translator to evaluate performance. Ollama: Use local testing environments to quickly iterate and validate model outputs. Conclusion In summary, Phi-3.5 exhibits impressive multilingual capabilities, effective deployment options, and robust performance metrics. Its ability to handle various languages makes it a versatile tool for natural language processing applications. Phi-3.5 stands out for its adaptability and performance in multilingual contexts, making it an excellent choice for future NLP projects, especially those requiring diverse language support. We encourage readers to experiment with the Phi-3.5 model using Azure AI Foundry or the AI Toolkit, explore fine-tuning techniques for their specific use cases, and share their findings with the community. For more information on optimized fine-tuning techniques, check out the Ignite Fine-Tuning Workshop. References Customize the Phi-3.5 family of models with LoRA fine-tuning in Azure Fine-tune Phi-3.5 models in Azure Fine Tuning with Azure AI Foundry and Microsoft Olive Hands on Labs and Workshop Customize a model with fine-tuning https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=azure-openai%2Cturbo%2Cpython-new&pivots=programming-language-studio Microsoft AI Toolkit - AI Toolkit for VSCode1.8KViews1like2CommentsFine-Tuning Language Models with Azure AI Foundry: A Detailed Guide
What is Azure AI Foundry? Azure AI Foundry is a comprehensive platform designed to simplify the development, deployment, and management of AI models. It provides a user-friendly interface and powerful tools that enable developers to create custom AI solutions without needing extensive machine learning expertise. Key Features of Azure AI Foundry One-Button Fine-Tuning: A streamlined process that allows users to fine-tune models with minimal configuration. Integration with Development Tools: Seamless integration with popular development environments, particularly Visual Studio Code. Support for Multiple Models: Access to a variety of pre-trained models, including the Phi family of models. Understanding Fine-Tuning Fine-tuning is the process of taking a pre-trained model and adapting it to a specific dataset or task. This is particularly useful when the base model has been trained on a large corpus of general data but needs to perform well on a narrower domain. Why Fine-Tune? Improved Performance: Fine-tuning can significantly enhance the model's accuracy and relevance for specific tasks. Reduced Training Time: Starting with a pre-trained model reduces the amount of data and time required for training. Customization: Tailor the model to meet the unique needs of your application or business. One-Button Fine-Tuning in Azure AI Foundry Step-by-Step Process Select the Model: Log in to Azure AI Foundry and navigate to the model selection interface. Choose Phi-3 or another small language model from the available options. Prepare Your Data: Ensure your dataset is formatted correctly. Typically, this involves having a set of input-output pairs that the model can learn from. Upload your dataset to Azure AI Foundry. The platform supports various data formats, making it easy to integrate your existing data. Initiate Fine-Tuning: Locate the one-button fine-tuning feature within the Azure AI Foundry interface. Click the button to start the fine-tuning process. The platform will handle the configuration and setup automatically. Monitor Progress: After initiating fine-tuning, you can monitor the process through the Azure portal. The portal provides real-time updates on training metrics, allowing you to track the model's performance as it learns. Evaluate the Model: Once fine-tuning is complete, evaluate the model's performance using a validation dataset. Azure AI Foundry provides tools for assessing accuracy, precision, recall, and other relevant metrics. Deploy the Model: After successful evaluation, you can deploy the fine-tuned model directly from Azure AI Foundry. The platform supports various deployment options, including REST APIs and integration with other Azure services. Using the AI Toolkit in Visual Studio Code Overview of the AI Toolkit The AI Toolkit for Visual Studio Code enhances the development experience by providing tools specifically designed for AI model management and fine-tuning. This integration allows developers to work within a familiar environment while leveraging powerful AI capabilities. Key Features of the AI Toolkit 1) Model Management: Easily manage and switch between different models, including Phi-3 and Ollama models. 2) Data Handling: Simplified data upload and preprocessing tools to prepare datasets for training. 3) Real-Time Collaboration: Collaborate with team members in real-time, sharing insights and progress on AI projects. How to Use the AI Toolkit 1) Install the AI Toolkit: Open Visual Studio Code and navigate to the Extensions Marketplace. Search for "AI Toolkit" and install the extension. 2) Connect to Azure AI Foundry: Once installed, configure the toolkit to connect to your Azure AI Foundry account. This will allow you to access your models and datasets directly from Visual Studio Code. 3) Fine-Tune Models: Use the toolkit to initiate fine-tuning processes directly from your development environment. Monitor training progress and view logs without leaving Visual Studio Code. 4) Consume Ollama Models: The AI Toolkit supports the consumption of Ollama models, providing additional flexibility in your AI projects. This feature allows you to integrate various models seamlessly, enhancing your application's capabilities. Microsoft ONNX Live for Fine-Tuning What is Microsoft ONNX Live? Microsoft ONNX Live is a platform that allows developers to deploy and optimize AI models using the Open Neural Network Exchange (ONNX) format. ONNX is an open-source format that enables interoperability between different AI frameworks, making it easier to deploy models across various environments. Key Features of Microsoft ONNX Live Model Optimization: ONNX Live provides tools to optimize models for performance, ensuring they run efficiently in production environments. Cross-Framework Compatibility: Models trained in different frameworks (like PyTorch or TensorFlow) can be converted to ONNX format, allowing for greater flexibility in deployment. Real-Time Inference: ONNX Live supports real-time inference, enabling applications to utilize AI models for immediate predictions. Fine-Tuning with ONNX Live Model Conversion: If you have a model trained in a different framework, you can convert it to ONNX format using tools provided by Microsoft. This conversion allows you to leverage the benefits of ONNX Live for deployment and optimization. Integration with Azure AI Foundry: Once your model is in ONNX format, you can integrate it with Azure AI Foundry for fine-tuning. The one-button fine-tuning feature can be used to adapt the ONNX model to your specific dataset. Optimization Techniques: After fine-tuning, you can apply various optimization techniques available in ONNX Live to enhance the model's performance. Techniques such as quantization and pruning can significantly reduce the model size and improve inference speed. Deployment: Once optimized, the model can be deployed directly from Azure AI Foundry or ONNX Live. This deployment can be done as a REST API, allowing easy integration with web applications and services. Additional Resources To further enhance your understanding and capabilities in fine-tuning language models, consider exploring the following resources: Phi-3 Cookbook: This comprehensive guide provides insights into getting started with Phi models, including best practices for fine-tuning and deployment. Explore the Phi-3 Cookbook. Ignite Fine-Tuning Workshop: This workshop offers a hands-on approach to learning about fine-tuning techniques and tools. It includes real-world scenarios to help you understand the practical applications of fine-tuning. Visit the GitHub Repository. Conclusion Fine-tuning language models like Phi-3 using Azure AI Foundry, combined with the AI Toolkit in Visual Studio Code and Microsoft ONNX Live, provides a powerful and efficient workflow for developers. The one-button fine-tuning feature simplifies the process, while the integration with ONNX Live allows for optimization and deployment flexibility. By leveraging these tools, you can enhance your AI applications, ensuring they are tailored to meet specific needs and perform optimally in production environments. Whether you are a seasoned AI developer or just starting, Azure AI Foundry and its associated tools offer a robust ecosystem for building and deploying advanced AI solutions. References Microsoft Docs Links Fine-Tuning Models in Azure OpenAI Azure AI Services Documentation Azure Machine Learning Documentation Microsoft Learn Links Develop Generative AI Apps in Azure Fine-Tune a Language Model Azure AI Foundry Overview Get started with AI Toolkit for Visual Studio Code2.1KViews0likes0CommentsWhat's New in Microsoft Foundry Labs – May 2026
Four new releases this month — a new benchmark for how agents interact, an experimental end-to-end agentic stack, a faster image model, and a first-party geospatial model. Last month we kicked off this series with a roundup of new Foundry Labs releases across speech, vision, and multimodal AI. This month, we're back with another update — read on to see learn what's new! SocialReasoning-Bench: measuring whether AI agents act in their user's best interest We are moving into a world where agents are interacting with other agents on behalf of their users, and thus, task completion is no longer a sufficient measure of usefulness. What matters is whether the agent advocates well for the person it represents. SocialReasoning-Bench, a new open-source benchmark from Microsoft Research AI Frontiers, measures exactly that. The benchmark currently supports two main scenarios — Calendar Coordination and Marketplace Negotiation — and scores them on two new metrics: Outcome Optimality (the share of available value the agent captures for its principal) and Due Diligence (the quality of the process used, scored against a deterministic reasonable-agent policy). Together they define an operational notion of duty of care. Learn more about SocialReasoning-Bench in Foundry Labs Try it on GitHub MagenticLite, Magentic Orchestrator & Fara 1.5: an end-to-end agentic stack Microsoft Research AI Frontiers also released a complete agentic stack: MagenticLite is the application layer — the next generation of Magentic-UI, with a redesigned chat-and-browser interface and a harness rebuilt for small models. It works across both your browser and your local file system in a single workflow, with browser sessions and code execution sandboxed by Quicksand, the project's open-source QEMU runtime. Transparency is baked in: you see what the agent is reasoning about, you can take direct control at any moment, and critical actions pause for explicit approval. MagenticBrain is the orchestrator of the stack — an orchestration model fine-tuned on Qwen 3 8B that plans, codes, and delegates. Critically, it was trained end-to-end inside the MagenticLite harness with the same tool schemas it sees at inference, eliminating the gap between training and execution. Fara1.5 is the next generation of Microsoft's computer-use model family — three models (4B, 9B, 27B) on Qwen 3.5, with the 9B as the recommended flagship. Fara1.5 sets a new state of the art among small computer-use models on the Online‑Mind2Web benchmark, nearly doubling the performance of the previously released Fara‑7B, and the 27B variant records 90+% on the same benchmark 1 . Together, they represent an open-source, end-to-end agentic stack that work together, so developers can build, plan, and run agents on infrastructure they control. Learn more about MagenticLite on Foundry Labs Try it on GitHub MAI-Image-2-Efficient: high-quality image generation at speed and scale MAI-Image-2-Efficient — Image‑2e for short — is Microsoft's latest text-to-image model, built on the same architecture as MAI-Image-2 (which debuted at #3 on the Arena.ai leaderboard for image model families) but engineered for the production workloads where every millisecond and every GPU hour matters. When normalized by latency and GPU usage, Image‑2e is up to 22% faster and 4x more efficient than MAI-Image-2 — and outpaces leading text-to-image models by 40% on average 1 . In short, it delivers more output for less compute, giving teams the headroom to iterate faster without blowing through their GPU budget. That efficiency unlocks new categories of work. E-commerce platforms, media companies, and marketing teams generating thousands of images per day for targeted ads, concept art, and mood boards translate it directly into larger batches at lower GPU cost. Chatbots, creative copilots, and AI-powered design tools translate it into latency low enough for real-time interaction. The model also has a distinct visual signature — sharp, defined lines that fit illustration, animation, and attention-grabbing photoreal imagery. Learn more about MAI-Image-2-Efficient in Foundry Labs Try it in Microsoft Foundry EO/OS Object Detection: production-grade earth observation Object detection on satellite and aerial imagery has historically required months of in-house computer vision engineering — bespoke models, custom labels, fragile pipelines. EO/OS Object Detection collapses that into a managed first-party endpoint in Microsoft Foundry. Built by the team behind Planetary Computer, EO/OS Object Detection is a model that identifies and localizes objects in overhead imagery and returns bounding-box detections optimized for batch processing of large image archives. It's part of a new GeoAI category in Microsoft Foundry, opening Microsoft's geospatial intelligence stack to anyone building on satellite or aerial data. Defense and intelligence teams analyzing satellite feeds, infrastructure operators monitoring assets at scale, agriculture and energy companies tracking change across vast landscapes, and disaster response teams triaging post-event imagery can all swap a custom one-off detector for a managed endpoint that fits inside their existing Foundry stack. Put simply, the work shifts from "build the detector" to "use the detector" — and the detection signal lands faster, more consistently, and inside the same Microsoft platform their broader AI work already runs on. Learn more about EO/OS Object Detection in Foundry Labs Try EO/OS Object Detection in Microsoft Foundry What's Next Foundry Labs is where Microsoft's most ambitious AI research becomes accessible to builders and where the products you'll rely on tomorrow are taking shape today. There's plenty more in the pipeline. Explore more AI innovations on Foundry Labs Join the Microsoft Foundry Discord community to shape the future of AI together References As tested on April 13, 2026. Compared to MAI-Image-2 when normalized by latency and GPU usage. Throughput per GPU vs MAI-Image-2 on NVIDIA H100 at 1024×1024; measured with optimized batch sizes and matched latency targets. Results vary with batch size, concurrency, and latency constraints.590Views2likes0CommentsNow in Foundry: Tongyi-MAI Z-Image-Turbo, with FLUX.1-schnell and SDXL base 1.0
This week's Model Mondays edition pairs three models available through the Hugging Face collection in Microsoft Foundry: Tongyi-MAI's Z-Image-Turbo, a new designed for lower latency on a single GPU and native bilingual text rendering; Black Forest Labs' FLUX.1-schnell, a 12B rectified flow transformer distilled to 1–4 step inference and one of the most adopted open-weight image models since its 2024 release; and Stability AI's stable-diffusion-xl-base-1.0 (SDXL), a latent diffusion research model that can be used to generate and modify images based on text prompts. Models of the week Tongyi-MAI: Z-Image-Turbo Model Specs Parameters / size: 6B (BF16) Resolution: Up to 1024×1024 native Primary task: Text-to-image generation (English and Chinese) Why it's interesting (Spotlight) Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture: Z-Image concatenates text tokens, visual semantic tokens, and image VAE tokens into a single unified input stream rather than running text and image through separate branches. This single-stream design can improve parameter efficiency relative to dual-stream DiT architectures at the same capacity. See the Z-Image technical report for details. 8-step inference at sub-second latency, fits in 16GB VRAM: Z-Image-Turbo is distilled with Decoupled Distribution Matching Distillation (Decoupled-DMD) and further refined with DMDR, a method that fuses DMD with reinforcement learning during post-training. The result is a model that runs 8 Number-of-Function-Evaluations (NFE) per image with no Classifier-Free Guidance (CFG)—which roughly halves the per-step compute compared to CFG-based inference. See the Decoupled-DMD and DMDR papers. Native bilingual text rendering and strong instruction adherence: Unlike most open-weight image models, which struggle with legible in-image text, Z-Image-Turbo renders complex English and Chinese text accurately which is useful for posters, signage, packaging mockups, and marketing creative. Try it Imagine you're a community programs coordinator at your city's parks department, planning a new summer event series — a "Cake Picnic in the Park" — designed to bring neighbors together over food in shared green space. The event is a few weeks out. You haven't booked bakery partners yet, so no actual cake exists, and you need marketing assets this week to start driving sign-ups: a hero image for the registration page, a flyer for community centers and libraries, social tiles for the city's channels. Use the prompt below and a photorealistic image, that can now be scaled to become additional assets like printed flyers or social images in minutes using image editing tools (or another model). Prompt: A round layered cake displayed on a white ceramic cake stand, topped with glossy fresh red cherries and smooth pastel pink buttercream frosting piped in delicate rosettes around the edge. One generous slice has been cleanly cut and removed from the front, revealing a perfect cross-section: four distinct horizontal layers alternating between soft pink sponge cake and fluffy white vanilla cream frosting. Professional bakery photography, soft natural window light from the left, shallow depth of field, marble countertop, warm and inviting atmosphere, photorealistic detail on the cake texture, cherry highlights, and frosting swirls. Black Forest Labs: FLUX.1-schnell Model Specs Parameters / size: 12B (rectified flow transformer) Resolution: Flexible up to 2 megapixels Primary task: Text-to-image generation Why it's interesting (Spotlight) Rectified flow transformer with adversarial distillation for 1–4 step inference: FLUX.1-schnell is the distilled, Apache 2.0 sibling of the FLUX.1 family. It uses a rectified flow formulation (a diffusion variant that learns straight-line probability paths between noise and data, reducing the number of solver steps needed) and is further compressed with latent adversarial diffusion distillation. The model generates high quality images in for latency-sensitive workloads. Permissive licensing for commercial use: Released under Apache 2.0, FLUX.1-schnell can be used for personal, scientific, and commercial purposes. This has driven broad adoption across product features that need an open, redistributable image backbone. Strong prompt adherence at its parameter range: At 12B parameters, FLUX.1-schnell sits between the SDXL family and frontier proprietary image models, and it remains a common reference point for evaluating open image generation prompt following—particularly for complex compositional prompts and longer captions—roughly two years after its initial release. Try it Hugging Face Spaces give developers the ability to experiment and try new models before deploying them. Test out a few prompts here: https://black-forest-labs-flux-1-schnell.hf.space then when you are ready, deploy the model in Microsoft Foundry. Stability AI: stable-diffusion-xl-base-1.0 stabilityai/stable-diffusion-xl-base-1.0 · Hugging Face Model Specs Parameters / size: 2.6B UNet (≈3.5B total with text encoders) Resolution: 1024×1024 native Primary task: Text-to-image generation Why it's interesting (Spotlight) Dual text encoder design and an ensemble-of-experts pipeline: SDXL uses two pretrained text encoders—OpenCLIP-ViT/G and CLIP-ViT/L—concatenated to capture both broad semantic alignment and finer-grained token-level cues. It can be run standalone or paired with the SDXL refiner in an ensemble-of-experts pipeline where the base model handles early denoising and the refiner specializes in the final steps. See the SDXL report for the original training and architecture details. CreativeML Open RAIL++-M licensing for managed deployments: SDXL is distributed under the CreativeML Open RAIL++-M license, which permits commercial use and downstream fine-tuning with documented use restrictions. Try it To go deeper on SDXL, take a look at Stability AI's generative-models GitHub repository, which implements the most popular diffusion frameworks for both training and inference and continues to expand with new capabilities like distillation. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry in two ways. The first by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. The second way is direct through the Hugging Face Hub, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure. Learn how to discover models and deploy them using Microsoft Foundry documentation: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry359Views0likes0CommentsA New Chapter for Realtime AI: Reasoning, Translation, and Real-Time Transcription
Voice can be one of the most direct and productive interfaces for AI — enabling customer support agents that may resolve issues without a single keystroke, live multilingual communication that can take on language barriers as conversations happen, and voice assistants capable of reasoning through complex requests in real time. Developers building these experiences need models that can keep pace with increasingly demanding latency, accuracy, and language coverage requirements. Today, OpenAI’s GPT-realtime-translate, GPT‑realtime‑2 and, GPT-realtime-whisper are rolling out into Microsoft Foundry starting today — together representing a significant step forward for the realtime model lineup available to developers on the platform. GPT-realtime-translate and GPT-realtime-whisper GPT-realtime-translate and GPT-realtime-whisper together extend the realtime stack for live multilingual audio workflows. GPT-realtime-translate is built for continuous, real-time translation, producing translated output as speech unfolds without relying on segmented pipeline processing, while GPT-realtime-whisper provides low-latency streaming transcription of the original audio in parallel. Used together, they help developers support scenarios such as live events, cross-language customer experiences, captions, monitoring, and archival workflows that require both translated output and visibility into the source speech. Continuous stream processing: This new model translates live audio without segmenting or buffering allowing for more natural interactions. New translation and transcription capabilities: Translate between languages in real time and observe faster text to speech. Available via the Realtime API GPT-realtime-2 GPT‑realtime‑2 is a generational upgrade to OpenAI's speech-to-speech model, bringing internal reasoning and an expanded context window to real-time voice applications. Where previous speech to speech models responded immediately, GPT‑realtime‑2 can work through a problem before speaking — making it well suited for voice applications that need to handle complex, multi-step queries entirely in the audio layer without routing to a separate text pipeline. Native reasoning capability: The newest realtime model introduces stronger reasoning capabilities. Now the model thinks internally before responding. Adjustable reasoning effort via {reasoning.effort}: Explicitly request the level of reasoning the model uses -- minimal, low, medium, high – to save on cost and latency. Audio in, audio out: No need for an intermediary text step, conversation stays fluid and natural. Available via the Realtime API This models is coming soon to Microsoft Foundry. Since, May 6, the models have been rolling out into the model catalog. We are excited for you to explore and build with our evolving collection of frontier models. Use cases These models work independently, but they're designed to complement each other in real-world pipelines: Live multilingual events. GPT-realtime-translate enables real-time translation of live audio, producing translated speech along with a transcript in the target language. GPT‑realtime‑whisper can be used in parallel to capture a transcription of the original speech for captions, monitoring, or archival purposes. Together, they enable multilingual live streaming with both translated experiences and visibility into the source language. Global customer support. Route inbound calls through GPT-realtime-translate to translate conversations in real time and provide a translated transcript for agents. Use GPT‑realtime‑whisper alongside it to capture the original conversation as text for compliance, quality review, or analytics. Then pass the interaction to an agent built with GPT‑realtime‑2 using {reasoning.effort}: high for complex issue resolution, all within a continuous audio pipeline. International voice assistants. Build once and deploy across languages. GPT-realtime-translate enables multilingual interaction and provides translated output with a target-language transcript, while GPT‑realtime‑whisper can optionally capture the original user input as text. GPT‑realtime‑2 manages reasoning and conversational context, supporting more complex voice interactions. Pricing Model Deployment Modality Pricing per 1M tokens Input Cached Input Output GPT-realtime-2 Global Standard Audio $32.00 $0.40 $64.00 Text $4.00 $0.40 $24.00 Image $5.00 $0.50 -- GPT-realtime-translate Global Standard Audio -- -- $2.04/hour GPT-realtime-whisper Global Standard Audio -- -- $1.02/hour *Pricing for GPT-realtime-translate and GPT-realtime-whisper will be done by the hour Getting Started Looking for ways to dive in? GPT-realtime-translate, GPT-realtime-whisper, and GPT‑realtime‑2 are rolling out into Microsoft Foundry today. Explore the model catalog and start building: https://ai.azure.com5.1KViews1like5CommentsIntroducing OpenAI's GPT-image-2 in Microsoft Foundry
Take a small design team running a global social campaign. They have the creative vision to produce localized imagery for every market, but not the resources to reshoot, reformat, or outsource that scale. Every asset needs to fit a different platform, a different dimension, a different cultural context, and they all need to ship at the same time. This is where flexible image generation comes in handy. OpenAI's GPT-image-2 is now generally available and rolling out today to Microsoft Foundry, introducing a step change in image generation. Developers and designers now get more control over image output, so a small team can execute with the reach and flexibility of a much larger one. What is new in GPT-image-2? GPT-image-2 brings real world intelligence, multilingual understanding, improved instruction following, increased resolution support, and an intelligent routing layer giving developers the tools to scale image generation for production workflows. Real world intelligence GPT-image-2 has a knowledge cut off of December 2025, meaning that it is able to give you more contextually relevant and accurate outputs. The model also comes with enhanced thinking capabilities that allow it to search the web, check its own outputs, and create multiple images from just one prompt. These enhancements shift image generation models away from being simple tools and runs them into creative sidekicks. Multilingual understanding GPT-image-2 includes increased language support across Japanese, Korean, Chinese, Hindi, and Bengali, as well as new thinking capabilities. This means the model can create images and render text that feels localized. Increased resolution support GPT-image-2 introduces 4K resolution support, giving developers the ability to generate rich, detailed, and photorealistic images at custom dimensions. Resolution guidelines to keep in mind: Constraint Detail Total pixel budget Maximum pixels in final image cannot exceed 8,294,400 Minimum pixels in final image cannot be less than 655,360 Requests exceeding this are automatically resized to fit. Resolutions 4K, 1024x1024, 1536x1024, and 1024x1536 Dimension alignment Each dimension must be a multiple of 16 Note: If your requested resolution exceeds the pixel budget, the service will automatically resize it down. Intelligent routing layer GPT-image-2 also includes an expanded routing layer with two distinct modes, allowing the service to intelligently select the right generation configuration for a request without requiring an explicitly set size value. Mode 1 — Legacy size selection In Mode 1, the routing layer selects one of the three legacy size tiers to use for generation: Size tier Description smimage Small image output image Standard image output xlimage Large image output This mode is useful for teams already familiar with the legacy size tiers who want to benefit from automatic selection without making any manual changes. Mode 2 — Token size bucket selection In Mode 2, the routing layer selects from six token size buckets — 16, 24, 36, 48, 64, 96 — which map roughly to the legacy size tiers: Token bucket Approximate legacy size 16, 24 smimage 36, 48 image 64, 96 xlimage This approach can allow for more flexibility in the number of tokens generated, which in turn helps to better optimize output quality and efficiency for a given prompt. See it in action GPT-image-2 shows improved image fidelity across visual styles, generating more detailed and refined images. But, don’t just take our word for it, let's see the model in action with a few prompts and edits. Here is the example we used: Prompt: Interior of an empty subway car (no people). Wide-angle view looking down the aisle. Clean, modern subway car with seats, poles, route map strip, and ad frames above the windows. Realistic lighting with a slight cool fluorescent tone, realistic materials (metal poles, vinyl seats, textured floor). As you can see, when using the same base prompt, the image quality and realism improved with each model. Now let’s take a look at adding incremental changes to the same image: Prompt: Populate the ad frames with a cohesive ad campaign for “Zava Flower Delivery” and use an array of flower types. And our subway is now full of ads for the new ZAVA flower delivery service. Let's ask for another small change: Prompt: In all Zava Flower Delivery advertisements, change the flowers shown to roses (red and pink roses). And in three simple prompts, we've created a mockup of a flower delivery ad. From marketing material to website creation to UX design, GPT-image-2 now allows developers to deliver production-grade assets for real business use cases. Image generation across industries These new capabilities open the door to richer, more production-ready image generation workflows across a range of enterprise scenarios: Retail & e-commerce: Generate product imagery at exact platform-required dimensions, from square thumbnails to wide banners, without post-processing. Marketing: Produce crisp, rich in color campaign visuals and social assets localized to different markets. Media & entertainment: Generate storyboard panels and scene at resolutions suited to production pipelines. Education & training: Create visual learning aids and course materials formatted to exact display requirements across devices. UI/UX design: Accelerate mockup and prototype workflows by generating interface assets at the precise dimensions your design system requires. Trust and safety At Microsoft, our mission to empower people and organizations remains constant. As part of this commitment, models made available through Foundry undergo internal reviews and are deployed with safeguards designed to support responsible use at scale. Learn more about responsible AI at Microsoft. For GPT-image-2, Microsoft applied an in-depth safety approach that addresses disallowed content and misuse while maintaining human oversight. The deployment combines OpenAI’s image generation safety mitigations with Azure AI Content Safety, including filters and classifiers for sensitive content. Pricing Model Offer type Pricing - Image Pricing - Text GPT-image-2 Standard Global Input Tokens: $8 Cached Input Tokens: $2 Output Tokens: $30 Input Tokens: $5 Cached Input Tokens: $1.25 Note: All prices are per 1M token. There is no billing for output tokens for the GPT-image-2 model. Getting started Whether you’re building a personalized retail experience, automating visual content pipelines or accelerating design workflows. GPT-image-2 gives your team the resolution control and intelligent routing to generate images that fit your exact needs. Try the GPT-image-2 in Microsoft Foundry today! Deploy the model in Microsoft Foundry Experiment with the model in the Image playground Read the documentation to learn more15KViews3likes3Comments