azure openai
87 TopicsAMA on Camera - AI Revolution: Azure OpenAI's Game-Changing Enhancements
Discover the latest enhancements to Azure OpenAI Service offering and deployments. Join our engineering experts to learn about Azure OpenAI Data Zones for flexible, multi-regional data processing and compliance. Explore simplified deployment flows, industry-leading performance guarantees, and cost-efficiency at scale. Perfect for anyone looking to leverage adaptable, reliable, and scalable AI solutions for enterprise applications. Please check out this blog post for more information: Accelerate scale with Azure OpenAI Service Provisioned offering | Microsoft Azure Blog Azure OpenAI Global Batch offering is designed to handle large-scale and high-volume processing tasks efficiently. Process asynchronous groups of requests with separate quota, a 24-hour turnaround time, at 50% less cost than global standard. Learn more There will be presentations on Batch, Data Zone Standard and Data Zone Provisioned and we will be taking questions throughout. Please share any questions that you would like our experts to address in the comments below. Questions can be posted anytime in the comments below beforehand, if it fits your schedule or time zone better, though questions will not be answered until the live hour. Questions will be answered both in the live broadcast on video and in text below. This will be a live stream video directly on this event page. NOTE: Please be aware the link sent via the private message to folks will not work due to some platform changes. We are working on redirecting it, but just in case, the new event link (the page you are on) is here: AMA on Camera - AI Revolution: Azure OpenAI's Game-Changing Enhancements | Microsoft Community Hub )5KViews13likes11CommentsAzure Data Explorer for Vector Similarity Search
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/series-cosine-similarity-function In the world of AI & data analytics, vector databases are emerging as a powerful tool for managing complex and high-dimensional data. In this article, we will explore the concept of vector databases, the need for vector databases in data analytics, and how Azure Data Explorer (ADX) aka Kusto can be used as a vector database.32KViews13likes5CommentsStop Drawing Architecture Diagrams Manually: Meet the Open-Source AI Architecture Review Agents
Designing and documenting software architecture is often a battle against static diagrams that become outdated the moment they are drawn. The Architecture Review Agent changes that by turning your design process into a dynamic, AI-powered workflow. In this post, we explore how to leverage Microsoft Foundry Hosted Agents, Azure OpenAI, and Excalidraw to build an open-source tool that instantly converts messy text descriptions, YAML, or README files into editable architecture diagrams. Beyond just drawing boxes, the agent acts as a technical co-pilot, delivering prioritized risk assessments, highlighting single points of failure, and mapping component dependencies. Discover how to eliminate manual diagramming, catch security flaws early, and deploy your own enterprise-grade agent with zero infrastructure overhead.16KViews7likes5CommentsBuilding a Scalable Web Crawling and Indexing Pipeline with Azure storage and AI Search
In the ever-evolving world of data management, keeping search indexes up-to-date with dynamic data can be challenging. Traditional approaches, such as manual or scheduled indexing, are resource-intensive, delay-prone, and difficult to scale. Azure Blob Trigger combined with an AI Search Indexer offers a cutting-edge solution to overcome these challenges, enabling real-time, scalable, and enriched data indexing. This blog explores how Blob Trigger, integrated with Azure Cognitive Search, transforms the indexing process by automating workflows and enriching data with AI capabilities. It highlights the step-by-step process of configuring Blob Storage, creating Azure Functions for triggers, and seamlessly connecting with an AI-powered search index. The approach leverages Azure's event-driven architecture, ensuring efficient and cost-effective data management.3.8KViews7likes10CommentsThree tiers of Agentic AI - and when to use none of them
Every enterprise has an AI agent. Almost none of them work in production. Walk into any enterprise technology review right now and you will find the same thing. Pilots running. Demos recorded. Steering committees impressed. And somewhere in the background, a quiet acknowledgment that the thing does not actually work at scale yet. OutSystems surveyed nearly 1,900 global IT leaders and found that 96% of organizations are already running AI agents in some capacity. Yet only one in nine has those agents operating in production at scale. The experiments are everywhere. The production systems are not. That gap is not a capability problem. The infrastructure has matured. Tool calling is standard across all major models. Frameworks like LangGraph, CrewAI, and Microsoft Agent Framework abstract orchestration logic. Model Context Protocol standardizes how agents access external tools and data sources. Google's Agent-to-Agent protocol now under Linux Foundation governance with over 50 enterprise technology partners including Salesforce, SAP, ServiceNow, and Workday standardizes how agents coordinate with each other. The protocols are in place. The frameworks are production ready. The gap is a selection and governance problem. Teams are building agents on problems that do not need them. Choosing the wrong tier for the ones that do. And treating governance as a compliance checkbox to add after launch, rather than an architectural input to design in from the start. The same OutSystems research found that 94% of organizations are concerned that AI sprawl is increasing complexity, technical debt, and security risk and only 12% have a centralized approach to managing it. Teams are deploying agents the way shadow IT spread through enterprises a decade ago: fast, fragmented, and without a shared definition of what production-ready actually means. I've built agentic systems across enterprise clients in logistics, retail, and B2B services. The failures I keep seeing are not technology failures. They are architecture and judgment failures problems that existed before the first line of code was written, in the conversation where nobody asked the prior question. This article is the framework I use before any platform conversation starts. What has genuinely shifted in the agentic landscape Three changes are shaping how enterprise agent architecture should be designed today and they are not incremental improvements on what existed before. The first is the move from single agents to multi-agent systems. Databricks' State of AI Agents report drawing on data from over 20,000 organizations, including more than 60% of the Fortune 500 found that multi-agent workflows on their platform grew 327% in just four months. This is not experimentation. It is production architecture shifting. A single agent handling everything routing, retrieval, reasoning, execution is being replaced by specialized agents coordinating through defined interfaces. A financial organization, for example, might run separate agents for intent classification, document retrieval, and compliance checking each narrow in scope, each connected to the next through a standardized protocol rather than tightly coupled code. The second is protocol standardization. MCP handles vertical connectivity how agents access tools, data sources, and APIs through a typed manifest and standardized invocation pattern. A2A handles horizontal connectivity how agents discover peer agents, delegate subtasks, and coordinate workflows. Production systems today use both. The practical consequence is that multi-agent architectures can be composed and governed as a platform rather than managed as a collection of one-off integrations. The third is governance as the differentiating factor between teams that ship and teams that stall. Databricks found that companies using AI governance tools get over 12 times more AI projects into production compared to those without. The teams running production agents are not running more sophisticated models. They built evaluation pipelines, audit trails, and human oversight gates before scaling not after the first incident. Tier 1 - Low-code agents: fast delivery with a defined ceiling The low-code tier is more capable than it was eighteen months ago. Copilot Studio, Salesforce Agentforce, and equivalent platforms now support richer connector libraries, better generative orchestration, and more flexible topic models. The ceiling is higher than it was. It is still a ceiling. The core pattern remains: a visual topic model drives a platform-managed LLM that classifies intent and routes to named execution branches. Connectors abstract credential management and API surface. A business team — analyst, citizen developer, IT operations — can build, deploy, and iterate without engineering involvement on every change. For bounded conversational problems, this is the fastest path from requirement to production. The production reality is documented clearly. Gartner data found that only 5% of Copilot Studio pilots moved to larger-scale deployment. A European telecom with dedicated IT resources and a full Microsoft enterprise agreement spent six months and did not deliver a single production agent. The visual builder works. The path from prototype to production, production-grade integrations, error handling, compliance logging, exception routing is where most enterprises get stuck, because it requires Power Platform expertise that most business teams do not have. The platform ceiling shows up predictably at four points. Async processing anything beyond a synchronous connector call, including approval chains, document pipelines, or batch operations cannot be handled natively. Full payload audit logs platform logs give conversation transcripts and connector summaries, not structured records of every API call and its parameters. Production volume concurrency limits and message throughput budgets bind faster than planning assumptions suggest. Root cause analysis in production you cannot inspect the LLM's confidence score or the alternatives it considered, which makes diagnosing misbehavior significantly harder than it should be. The correct diagnostic: can this use case be owned end-to-end by a business team, covered by standard connectors, with no latency SLA below three seconds and no payload-level compliance requirement? Yes, low code is the correct tier. Not a compromise. If no on any point, continue. If low-code is the right call for your use case: Copilot Studio quickstart Tier 2 - Pro-code agents: the architecture the current landscape demands The defining pattern in production pro-code architecture today is multi-agent. Specialized agents per domain, coordinating through MCP for tool access and A2A for peer-to-peer delegation, with a governance layer spanning the entire system. What this looks like in practice: a financial organization handling incoming compliance queries runs separate agents for intent classification, document retrieval, and the compliance check itself. None of these agents tries to do all three jobs. Each has a narrow responsibility, a defined input/output contract typed against a JSON Schema, and a clear handoff boundary. The 327% growth in multi-agent workflows reflects production teams discovering that the failure modes of monolithic agents topic collision, context overflow, degraded classification as scope expands are solved by specialization, not by making a single agent more capable. The discipline that makes multi-agent systems reliable is identical to what makes single-agent systems reliable, just enforced across more boundaries: the LLM layer reasons and coordinates; deterministic tool functions enforce. In a compliance pipeline, no LLM decides whether a document satisfies a regulatory requirement. That evaluation runs in a deterministic tool with a versioned rule set, testable outputs, and an immutable audit log. The LLM orchestrates the sequence. The tool produces the compliance record. Mixing these letting an LLM evaluate whether a rule pass collapses the audit trail and introduces probabilistic outputs on questions that have regulatory answers. MCP is the tool interface standard today. An MCP server exposes a typed manifest any compliant agent runtime can discover at startup. Tools are versioned, independently deployable, and reusable across agents without bespoke integration code. A2A extends this horizontally: agents advertise capability cards, discover peers, and delegate subtasks through a standardised protocol. The practical consequence is that multi-agent systems built on both protocols can be composed and governed as a platform rather than managed as a collection of one-off integrations. Observability is the architectural element that separates teams shipping production agents from teams perpetually in pilot. Build evaluation pipelines, distributed traces across all agent boundaries, and human review gates before scaling. The teams that add these after the first production incident spend months retrofitting what should have been designed in. If pro-code is the right call for your use case: Foundry Agent Service The hybrid pattern: still where production deployments land The shift to multi-agent architecture does not change the hybrid pattern it deepens it. Low-code at the conversational surface, pro-code multi-agent systems behind it, with a governance layer spanning both. On a logistics client engagement, the brief was a sales assistant for account managers shipment status, account health, and competitive context inside Teams. The business team wanted everything in Copilot Studio. Engineering wanted a custom agent runtime. Both were wrong. What we built: Copilot Studio handled all high-frequency, low-complexity queries shipment tracking, account status, open cases through Power Platform connectors. Zero custom code. That covered roughly 78% of actual interaction volume. Requests requiring multi-source reasoning competitive positioning on a specific lane, churn risk across an account portfolio, contract renewal analysis delegated via authenticated HTTP action to a pro-code multi-agent service on Azure. A retrieval agent pulled deal history and market intelligence through MCP-exposed tools. A synthesis agent composed the recommendation with confidence scoring. Structured JSON back to the low-code layer, rendered as an adaptive card in Teams. The HITL gate was non-negotiable and designed before deployment, not added after the first incident. No output reached a customer without a manager approval step. The agent drafts. A human sends. This boundary low-code for conversational volume, pro-code for reasoning depth maps directly to what the research shows separates teams that ship from teams that stall. The organizations running agents in production drew the line correctly between what the platform can own and what engineering needs to own. Then they built governance into both sides before scaling. The four gates - the prior question that still gets skipped Run every candidate use case through these four checks before the platform conversation begins. None of the recent infrastructure improvements change what they are checking, because none of them change the fundamental cost structure of agentic reasoning. Gate 1 - is the logic fully deterministic? If every valid output for every valid input can be enumerated in unit tests, the problem does not need an LLM. A rules engine executes in microseconds at zero inference cost and cannot produce a plausible-but-wrong answer. NeuBird AI's production ops agents which have resolved over a million alerts and saved enterprises over $2 million in engineering hours work because alert triage logic that can be expressed as rules runs in deterministic code, and the LLM only handles cases where pattern-matching is insufficient. That boundary is not incidental to the system's reliability. It is the reason for it. Gate 2 - is zero hallucination tolerance required? With over 80% of databases now being built by AI agents per Databricks' State of AI Agents report the surface area for hallucination-induced data errors has grown significantly. In domains where a wrong answer is a compliance event financial calculation, medical logic, regulatory determinations irreducible LLM output uncertainty is disqualifying regardless of model version or prompt engineering effort. Exit to deterministic code or classical ML with bounded output spaces. Gate 3 - is a sub-100ms latency SLA required? LLM inference is faster than it was eighteen months ago. It is not fast enough for payment transaction processing, real-time fraud scoring, or live inventory management. A three-agent system with MCP tool calls has a P50 latency measured in seconds. These problems need purpose-built transactional architecture. Gate 4 - is regulatory explainability required? A2A enables complex agent coordination and delegation. It does not make LLM reasoning reproducible in a regulatory sense. Temperature above zero means the same input produces different outputs across invocations. Regulators in financial services, healthcare, and consumer credit require deterministic, auditable decision rationale. Exit to deterministic workflow with structured audit logging at every Five production failure modes - one of them new The four original anti-patterns are still showing up in production. A fifth has been added by scale. Routing data retrieval through a reasoning loop. A direct API call returns account status in under 10ms. Routing the same request through an LLM reasoning step adds hundreds of milliseconds, consumes tokens on every call, and introduces output parsing on data that is already structured. The agent calls a structured tool. The tool calls the API. The agent never acts as the integration layer. Encoding business rules in prompts. Rules expressed in prompt text drift as models update. They produce probabilistic output across invocations and fail in ways that are difficult to reproduce and diagnose. A rule that must evaluate correctly every time belongs in a deterministic tool function unit-tested, version-controlled, independently deployable via MCP. No approval gate on CRUD operations. CRUD operations without a human approval step will eventually misfire on the input that testing did not cover. The gate needs to be designed before deployment, not added after the first incident involving a financial posting, a customer-facing communication, or a data deletion. Monolithic agent for all domains. A single agent accumulating every domain leads predictably to topic collision, context overflow, and maintenance that becomes impossible as scope expands. Specialized agents per domain, coordinating through A2A, is the architecture that scales. Ungoverned agent sprawl. This is the new one and currently the most prevalent. OutSystems found 94% of organizations concerned about it, with only 12% having a centralized response. Teams building agents independently across fragmented stacks, without shared governance, evaluation standards, or audit infrastructure, produce exactly the same organizational debt that shadow IT created but with higher stakes, because these systems make autonomous decisions rather than just storing and retrieving data. The fix is treating governance as an architectural input before deployment, not a compliance requirement after something breaks. The infrastructure is ready. The judgment is not. The tier decision sequence has not changed. Does the problem need natural language understanding or dynamic generation? No — deterministic system, stop. Can a business team own it through standard connectors with no sub-3-second latency SLA and no payload-level compliance requirement? Yes — low-code. Does it need custom orchestration, multi-agent coordination, or audit-grade observability? Yes — pro-code with MCP and A2A. Does it need both a conversational surface and deep backend reasoning? Hybrid, with a governance layer spanning both. What has changed is that governance is no longer optional infrastructure to add when you have time. The data is unambiguous. Companies with governance tools get over 12 times more AI projects into production than those without. Evaluation pipelines, distributed tracing across agent boundaries, human oversight gates, and centralised agent lifecycle management are not overhead. They are what converts experiments into production systems. The teams still stuck in pilot are not stuck because the technology failed them. They are stuck because they skipped this layer. The protocols are standardised. The frameworks are mature. The infrastructure exists. None of that is what is holding most enterprise agent programmes back. What is holding them back is a selection problem disguised as a technology problem — teams building agents before asking whether agents are warranted, choosing platforms before running the four gates, and treating governance as a checkpoint rather than an architectural input. I have built agents that should have been workflow engines. Not because the technology was wrong, but because nobody stopped early enough to ask whether it was necessary. The four gates in this article exist because I learned those lessons at clients' expense, not mine. The most useful thing I can offer any team starting an agentic AI project is not a framework selection guide. It is permission to say no — and a clear basis for saying it. Take the four gates framework to your next architecture review. If you have already shipped agents to production, I would like to hear what worked and what did not - comment below What to do next Three concrete steps depending on where you are right now. If you have pilots that have not reached production: Run them through the four gates in this article before the next sprint. Gate 1 alone will eliminate a meaningful percentage of them. The ones that survive all four are your real candidates for production investment. Download the attached file for gated checklist and take it into your next architecture review. If you are starting a new agent project: Do not open a platform before you have answered the gate questions. Once you have confirmed an agent is warranted and identified the tier, start here: Copilot Studio guided setup for low-code scenarios, or Foundry Agent Service for pro-code patterns with MCP and multi-agent coordination built in. Build governance infrastructure - evaluation pipeline, distributed tracing, HITL gates - before you scale, not after. If you have already shipped agents to production: Share what worked and what did not in the Azure AI Tech Community — tag posts with #AgentArchitecture. The most useful signal for teams still in pilot is hearing from practitioners who have been through production, not vendors describing what production should look like. References OutSystems — State of AI Development Report - https://www.outsystems.com/1/state-ai-development-report Databricks — State of AI Agents Report - https://www.databricks.com/resources/ebook/state-of-ai-agents Gartner — 2025 Microsoft 365 and Copilot Survey - https://www.gartner.com/en/documents/6548002 (Paywalled primary source — publicly reported via techpartner.news: https://www.techpartner.news/news/gartner-microsoft-copilot-hype-offset-by-roi-and-readiness-realities-618118) Anthropic — Model Context Protocol (MCP) - https://modelcontextprotocol.io Google Cloud — Agent-to-Agent Protocol (A2A) . https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability NeuBird AI — Production Operations Deployment Announcement NeuBird AI Closes $19.3M Funding Round to Scale Agentic AI Across Enterprise Production Operations ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. https://arxiv.org/abs/2210.03629 Enterprise Integration Patterns — Gregor Hohpe & Bobby Woolf, Addison-Wesley https://www.enterpriseintegrationpatterns.com4.1KViews4likes1CommentGPT Capability in Understanding Coordinates: How GPT-5.4 Transforms Spatial Precision
Why I Ran This Experiment This work started not as a benchmarking exercise, but as a practical problem: I needed to automatically extract panel regions from PDF-format electrical Single-Line Diagram (SLD) drawings using OpenAI models . All experiments were conducted using OpenAI models in Microsoft Foundry- Microsoft's unified platform for building generative AI applications. The downstream goal was a pipeline that combines GPT model with Azure Document Intelligence to generate Bills of Materials (BOMs) — a project I wrote about separately in Extracting BOMs from Electrical Drawings with AI: Azure OpenAI GPT-5 + Azure Document Intelligence Pipeline. Before building that pipeline, I needed a clear-eyed answer to a deceptively simple question: how well can GPT actually understand and return pixel-level coordinates from an image? If the model can't reliably locate a panel bounding box, the rest of the pipeline doesn't matter. When I first ran these tests against GPT-5.2, the results were mixed — good enough to be promising, but inconsistent enough to leave clear room for improvement. I tried many workarounds: feeding image dimensions explicitly, overlaying coordinate grids, enabling extended reasoning, and building iterative self-correction loops. Each helped, but required deliberate engineering effort. Then GPT-5.4 was released. Re-running the same benchmark revealed that most of those workarounds were no longer necessary. Context: All experiments use a fixed CAD-style test image (847 × 783 px) with a known ground-truth bounding box at [135, 165, 687, 619] . Accuracy is measured by Intersection over Union (IoU) — a score of 1.0 is a perfect match. Every test was run 5 times and averaged. for all coordinate experiments. The Experiment Design I designed experiments across two axes: prompt strategy (how spatial information is presented to the model) and reasoning mode (standard vs. extended reasoning). Each combination was tested across both GPT-5.2 and GPT-5.4, producing 4 conditions per test. GPT-5.2 and GPT-5.4 were each tested under two reasoning modes (None vs. High), resulting in four conditions in total. Single-Shot Strategies (Tests 1–5) These tests have no iterative validation loop — the model gets one prompt and returns its answer. Each test was run 5 times and the results averaged, so the scores reflect consistency, not a single lucky attempt. The differences between tests lie in how spatial information is framed in the prompt. Test 1 is a simple sanity check: can the model understand percentage-based coordinates at all? The model receives the clean image (no overlay) and is asked: "return the pixel coordinate at 30% width, 50% height." The expected answer is (254, 392). GPT-5.2 gets the X coordinate roughly right (~254–260), but the Y coordinate scatters wildly — predictions range from 260 to 322, consistently 100+ pixels above the correct position. GPT-5.4 returns (254, 392) on every single run, essentially pixel-perfect. Even on this simple sanity check, the gap is stark: GPT-5.4 is pixel-perfect from the start, while GPT-5.2 shows a clear Y-axis bias. But a single-point test doesn't tell us how well the models handle real spatial tasks. The next question: can they detect a full bounding box? Tests 2–5: Bounding Box Detection with Increasing Prompt Richness Tests 2–5 move to the real task: detecting a bounding box drawn on the image. Each test sends a different version of the same base image, with progressively richer spatial context in the prompt: Feedback Loop Strategies (Tests 6A–7B) These tests add an iterative validation loop: the model's predicted bounding box is overlaid on the image and sent back for self-correction — up to 5 iterations (early stop at IoU ≥ 0.99). All feedback tests share the same two-phase structure: an init step (first prediction) and a validation loop (iterative correction). All feedback tests use the same two images (init + validation overlay), but differ in prompt strategy and color assignment. Image-wise, they fall into two groups: Group A — Orange GT (Tests 6A, 6C, 7A) Group B — Color Bias / Blue GT (Tests 6B, 6D, 7B) Figure 4b — Feedback loop input images. Group B (bottom): colors swapped to test color-role priors. What differs between tests in the same group: The images are identical, but the prompt changes. 6A/6B use holistic comparison ("compare and correct"). 6C/6D additionally send the full history of past predictions as multi-image input. 7A/7B ask for per-edge directional judgments ("move left/right/up/down/none" for each edge independently). Results 1. Model version is the single biggest factor Across every test, GPT-5.4 dramatically outperforms GPT-5.2. The gap is not incremental — it's the difference between a bounding box that roughly overlaps the target and one that is essentially pixel-perfect. GPT-5.4 achieved an IoU of 0.99 or above on its very first attempt on tests where GPT-5.2 had only scored between 0.76 and 0.88. GPT-5.4 (green bars) consistently hits ≥0.99 regardless of prompt strategy or reasoning mode. GPT-5.2 (blue bars) ranges from 0.76 to 0.92. 2. GPT-5.2 is inconsistent; GPT-5.4 locks in Raw averages only tell half the story. GPT-5.2 is unpredictable: on the exact same test with the exact same prompt and image, results fluctuate wildly between runs. The standard deviation on Test 2 is ±0.084 — meaning a single run could land anywhere from 0.66 to 0.88. GPT-5.4 stays within ±0.003. The scatter plots below make this viscerally clear. Each dot is one API call — notice how GPT-5.2 dots spray across the IoU range while GPT-5.4 dots stack on top of each other: Wide scatter on simpler prompts (Test 2: 0.66–0.88); reasoning mode (orange) provides a lift that shrinks with richer prompts (Δmean shown below each panel). Production implication: With GPT-5.2, you couldn't rely on a single inference call — building a reliable pipeline would require multiple calls and majority voting, multiplying latency and cost. With GPT-5.4, a single call is sufficient. 3. Reasoning mode reduced variance for GPT-5.2; GPT-5.4 didn't need it For GPT-5.2, enabling extended reasoning ( reasoning: high ) provided a meaningful boost — especially when the prompt was sparse. On Test 2 (bare image, no spatial context), reasoning added +0.076 IoU and visibly tightened the spread of results across runs. As prompts got richer, the benefit shrank: with a grid overlay (Test 4), reasoning added only +0.007. In other words, reasoning mode acted as a compensating mechanism — filling in the gaps when the prompt alone didn't provide enough spatial scaffolding. For GPT-5.4, reasoning mode offered no additional benefit on this class of task. The base model already achieves 0.99+ IoU, so there was simply no room for improvement. In a few cases the reasoning runs showed marginal regressions (−0.005 to −0.015), likely within noise. The takeaway isn't that reasoning mode is harmful in general, but rather that a spatial-coordinate task at this complexity level doesn't require it when the underlying model already has strong coordinate understanding. Figure 7 — Effect of reasoning mode: GPT-5.2 gains +0.04–0.08 from reasoning (blue bars), largest on sparse prompts. GPT-5.4 shows no meaningful gain (green bars near zero). 4. Richer prompts close the gap (but only for GPT-5.2) For GPT-5.2, providing more spatial context in the prompt made a big difference: from 0.765 (Test 2, no info) to 0.910 (Test 4, grid overlay) — a +0.145 IoU gain just from adding visual reference rulers to the image. Telling the model the image dimensions (Test 3) was a "free win" that cost nothing. For GPT-5.4, all prompt variants produce essentially the same result (0.989–0.997). The model already understands spatial coordinates well enough that extra scaffolding adds no value. ided. GPT-5.4 is flat at ≥0.99 regardless. If you're still on GPT-5.2: Always inject image dimensions into the prompt (free). Use grid overlays for the biggest single-shot gain (+0.145 IoU). With GPT-5.4, none of this is needed. 5. Validation loops: essential for GPT-5.2, Option for GPT-5.4 The feedback loop tests (6A–7B) showed that iterative self-correction genuinely helped GPT-5.2 improve from its initial prediction. For example, in Test 7A (directional feedback), GPT-5.2 improved from an init IoU of 0.926 to a best of 0.969 over 5 iterations. For GPT-5.4, every single run hit IoU ≥ 0.99 on iteration 1 and early-stopped immediately. There was nothing left to correct. The validation loop infrastructure — overlay rendering, multi-turn prompting, iteration logic — becomes dead code you can remove from your pipeline. GPT-5.4 (green) starts at ≥0.99 and early-stops at iteration 1. 6. Prompt instruction matters: holistic vs directional feedback Comparing 6A/6B (holistic: "compare the two boxes and correct") with 7A/7B (directional: "for each edge, decide which direction to move"), the directional approach consistently reached higher best IoU for GPT-5.2. The per-edge structured output forced the model to reason about each boundary independently rather than making a holistic guess. Separately, the color bias tests (6B, 7B — GT drawn in blue instead of orange) revealed that swapping GT/prediction colors drops the initial accuracy significantly. In 6A (orange GT) the init IoU was 0.937, but in 6B (blue GT) it dropped to 0.850. This suggests GPT models have learned color-role priors — orange is "expected" as the ground truth color. However, the validation loop largely recovers this gap: after 5 iterations, 6A and 6B converge to similar best IoU (~0.96). The directional variants (7A, 7B) show the same pattern but converge faster. Left: initial accuracy drops when GT is drawn in blue. Right: after the validation loop, the gap closes. Directional feedback (7A/7B) shows the same pattern. For GPT-5.4: Color bias has no measurable effect. All variants (6A/6B/7A/7B) hit 0.994–0.998 IoU on iteration 1 regardless of color assignment. Summary: What Changed from GPT-5.2 to GPT-5.4 The story of this benchmark is really about engineering workarounds that became unnecessary. Here's what we built for GPT-5.2 and whether you still need it: Grid overlays & image dimensions in prompt — Gave +0.05–0.15 IoU for GPT-5.2. Not needed for GPT-5.4 (already ≥0.99 without it). Extended reasoning mode — Gave +0.04–0.08 IoU for GPT-5.2. No benefit for GPT-5.4 on this task (already at ceiling without it). Validation loops (iterative self-correction) — Improved GPT-5.2 by +0.02–0.10 IoU over 5 iterations. Unnecessary for GPT-5.4 (early-stops at iteration 1). Multiple runs & voting — Required for GPT-5.2 due to ±0.08 variance. Not needed for GPT-5.4 (±0.003 variance, single call sufficient). Color convention management — GPT-5.2 showed color bias (−0.09 IoU when colors swapped). No effect on GPT-5.4. GPT-5.4 doesn't just perform better — it makes entire categories of pipeline engineering unnecessary. For clean, CAD-style images like the ones tested here, GPT-5.4 dramatically reduces prompt engineering overhead: grid overlays, image dimension injection, reasoning mode, and validation loops — all of which required deliberate effort with GPT-5.2 — are no longer necessary. This translates directly to simpler pipelines, lower latency, and lower cost. That said, for more complex scenarios — multiple overlapping panels, cluttered backgrounds, or ambiguous region boundaries — iterative validation loops could still prove valuable, and we plan to explore this in future work. This benchmark started as a sanity check and turned into a clear signal: GPT-5.4 represents a genuine leap in spatial coordinate understanding, not just a marginal iteration. The gap between 0.765 and 0.997 IoU on an identical task is the difference between a prototype experiment and a production-ready component. Try It Yourself Ready to explore GPT-5.4's spatial precision capabilities? Here are ways to get started: Sample notebooks for bounding box extraction test : github Read the companion post: Extracting BOMs from Electrical Drawings with AI: Azure OpenAI GPT-5 + Azure Document Intelligence — See how this benchmark informed a production pipeline647Views4likes0CommentsTeach ChatGPT to Answer Questions: Using Azure AI Search & Azure OpenAI (Semantic Kernel)
In this two-part series, we will explore how to build intelligent service using Azure. In Series 1, we'll use Azure AI Search to extract keywords from unstructured data stored in Azure Blob Storage. In Series 2, we'll Create a feature to answer questions based on PDF documents using Azure OpenAI26KViews4likes3CommentsTeach ChatGPT to Answer Questions: Using Azure AI Search & Azure OpenAI (Lang Chain)
In this two-part series, we will explore how to build intelligent service using Azure. In Series 1, we'll use Azure AI Search to extract keywords from unstructured data stored in Azure Blob Storage. In Series 2, we'll Create a feature to answer questions based on PDF documents using Azure OpenAI.44KViews4likes9CommentsAzure OpenAI GPT model to review Pull Requests for Azure DevOps
In recent months, the use of Generative Pre-trained Transformer (GPT) models for natural language processing (NLP) has gained significant traction. GPT models, which are based on the Transformer architecture, can generate text from arbitrary sources of input data and can be trained to identify errors and detect anomalies in text. As such, GPT models are increasingly being used for a variety of applications, ranging from natural language understanding to text summarization and question-answering. In the software development world, developers use pull requests to submit proposed changes to a codebase. However, reviews by other developers can sometimes take a long time and not accurate, and in some cases, these reviews can introduce new bugs and issues. In order to reduce this risk, During my research I found the integration of GPT models is possible and we can add Azure OpenAI service as pull request reviewers for Azure Pipelines service. The GPT models are trained on developer codebases and are able to detect potential coding issues such as typos, syntax errors, style inconsistencies and code smells. In addition, they can also assess code structure and suggest improvements to the overall code quality. Once the GPT models have been trained, they can be integrated into the Azure Pipelines service so that they can automatically review pull requests and provide feedback. This helps to reduce the time taken for code reviews, as well as reduce the likelihood of introducing bugs and issues.48KViews4likes13CommentsUnlocking the Power of Open AI – Azure DevOps Backlogs from Images/PDFs
In today's digital world, the need to convert images and PDFs to text is becoming increasingly important. However, the process of manually transcribing images and PDFs can be time-consuming and error-prone. Fortunately, there is a better way. With the Azure Open AI service, you can easily and quickly convert images and PDFs to text.4.9KViews4likes0Comments