A systematic benchmark of bounding box accuracy across GPT-5.2 and GPT-5.4 — born from a real-world need to extract panel layouts from electrical SLD drawings.
Why I Ran This Experiment
This work started not as a benchmarking exercise, but as a practical problem: I needed to automatically extract panel regions from PDF-format electrical Single-Line Diagram (SLD) drawings using OpenAI models . All experiments were conducted using OpenAI models in Microsoft Foundry- Microsoft's unified platform for building generative AI applications.
The downstream goal was a pipeline that combines GPT model with Azure Document Intelligence to generate Bills of Materials (BOMs) — a project I wrote about separately in Extracting BOMs from Electrical Drawings with AI: Azure OpenAI GPT-5 + Azure Document Intelligence Pipeline.
Before building that pipeline, I needed a clear-eyed answer to a deceptively simple question: how well can GPT actually understand and return pixel-level coordinates from an image? If the model can't reliably locate a panel bounding box, the rest of the pipeline doesn't matter.
When I first ran these tests against GPT-5.2, the results were mixed — good enough to be promising, but inconsistent enough to leave clear room for improvement. I tried many workarounds: feeding image dimensions explicitly, overlaying coordinate grids, enabling extended reasoning, and building iterative self-correction loops. Each helped, but required deliberate engineering effort. Then GPT-5.4 was released. Re-running the same benchmark revealed that most of those workarounds were no longer necessary.
Context: All experiments use a fixed CAD-style test image (847 × 783 px) with a known ground-truth bounding box at [135, 165, 687, 619]. Accuracy is measured by Intersection over Union (IoU) — a score of 1.0 is a perfect match. Every test was run 5 times and averaged.
for all coordinate experiments.
The Experiment Design
I designed experiments across two axes: prompt strategy (how spatial information is presented to the model) and reasoning mode (standard vs. extended reasoning). Each combination was tested across both GPT-5.2 and GPT-5.4, producing 4 conditions per test. GPT-5.2 and GPT-5.4 were each tested under two reasoning modes (None vs. High), resulting in four conditions in total.
Single-Shot Strategies (Tests 1–5)
These tests have no iterative validation loop — the model gets one prompt and returns its answer. Each test was run 5 times and the results averaged, so the scores reflect consistency, not a single lucky attempt. The differences between tests lie in how spatial information is framed in the prompt.
Test 1 is a simple sanity check: can the model understand percentage-based coordinates at all? The model receives the clean image (no overlay) and is asked: "return the pixel coordinate at 30% width, 50% height." The expected answer is (254, 392).
GPT-5.2 gets the X coordinate roughly right (~254–260), but the Y coordinate scatters wildly — predictions range from 260 to 322, consistently 100+ pixels above the correct position. GPT-5.4 returns (254, 392) on every single run, essentially pixel-perfect.
Even on this simple sanity check, the gap is stark: GPT-5.4 is pixel-perfect from the start, while GPT-5.2 shows a clear Y-axis bias. But a single-point test doesn't tell us how well the models handle real spatial tasks. The next question: can they detect a full bounding box?
Tests 2–5: Bounding Box Detection with Increasing Prompt Richness
Tests 2–5 move to the real task: detecting a bounding box drawn on the image. Each test sends a different version of the same base image, with progressively richer spatial context in the prompt:
Table 1 — Single-shot test descriptions: prompt strategy and input type for Tests 1–5.Figure 3 — Input images for Tests 2–5, from simplest (orange bbox only) to most structured (numbered grid).
Feedback Loop Strategies (Tests 6A–7B)
These tests add an iterative validation loop: the model's predicted bounding box is overlaid on the image and sent back for self-correction — up to 5 iterations (early stop at IoU ≥ 0.99). All feedback tests share the same two-phase structure: an init step (first prediction) and a validation loop (iterative correction).
All feedback tests use the same two images (init + validation overlay), but differ in prompt strategy and color assignment. Image-wise, they fall into two groups:
Group A — Orange GT (Tests 6A, 6C, 7A)
Figure 4a — Feedback loop input images. Orange GT box with blue prediction
Group B — Color Bias / Blue GT (Tests 6B, 6D, 7B)
Figure 4b — Feedback loop input images. Group B (bottom): colors swapped to test color-role priors.
What differs between tests in the same group: The images are identical, but the prompt changes. 6A/6B use holistic comparison ("compare and correct"). 6C/6D additionally send the full history of past predictions as multi-image input. 7A/7B ask for per-edge directional judgments ("move left/right/up/down/none" for each edge independently).
Results
1. Model version is the single biggest factor
Across every test, GPT-5.4 dramatically outperforms GPT-5.2. The gap is not incremental — it's the difference between a bounding box that roughly overlaps the target and one that is essentially pixel-perfect. GPT-5.4 achieved an IoU of 0.99 or above on its very first attempt on tests where GPT-5.2 had only scored between 0.76 and 0.88.
Figure 5 — Single-shot IoU across all 4 conditions.GPT-5.4 (green bars) consistently hits ≥0.99 regardless of prompt strategy or reasoning mode.
GPT-5.2 (blue bars) ranges from 0.76 to 0.92.
2. GPT-5.2 is inconsistent; GPT-5.4 locks in
Raw averages only tell half the story. GPT-5.2 is unpredictable: on the exact same test with the exact same prompt and image, results fluctuate wildly between runs. The standard deviation on Test 2 is ±0.084 — meaning a single run could land anywhere from 0.66 to 0.88. GPT-5.4 stays within ±0.003.
The scatter plots below make this viscerally clear. Each dot is one API call — notice how GPT-5.2 dots spray across the IoU range while GPT-5.4 dots stack on top of each other:
Figure 6a — GPT-5.2: per-run IoU across Tests 2–5.Wide scatter on simpler prompts (Test 2: 0.66–0.88); reasoning mode (orange) provides a lift that shrinks with richer prompts
(Δmean shown below each panel).Figure 6b — GPT-5.4: same view — all dots cluster at 0.97–1.0. No meaningful variance, no reasoning benefit.
Production implication: With GPT-5.2, you couldn't rely on a single inference call — building a reliable pipeline would require multiple calls and majority voting, multiplying latency and cost. With GPT-5.4, a single call is sufficient.
3. Reasoning mode reduced variance for GPT-5.2; GPT-5.4 didn't need it
For GPT-5.2, enabling extended reasoning (reasoning: high) provided a meaningful boost — especially when the prompt was sparse. On Test 2 (bare image, no spatial context), reasoning added +0.076 IoU and visibly tightened the spread of results across runs. As prompts got richer, the benefit shrank: with a grid overlay (Test 4), reasoning added only +0.007. In other words, reasoning mode acted as a compensating mechanism — filling in the gaps when the prompt alone didn't provide enough spatial scaffolding.
For GPT-5.4, reasoning mode offered no additional benefit on this class of task. The base model already achieves 0.99+ IoU, so there was simply no room for improvement. In a few cases the reasoning runs showed marginal regressions (−0.005 to −0.015), likely within noise. The takeaway isn't that reasoning mode is harmful in general, but rather that a spatial-coordinate task at this complexity level doesn't require it when the underlying model already has strong coordinate understanding.
Figure 7 — Effect of reasoning mode: GPT-5.2 gains +0.04–0.08 from reasoning (blue bars), largest on sparse prompts.
GPT-5.4 shows no meaningful gain (green bars near zero).
4. Richer prompts close the gap (but only for GPT-5.2)
For GPT-5.2, providing more spatial context in the prompt made a big difference: from 0.765 (Test 2, no info) to 0.910 (Test 4, grid overlay) — a +0.145 IoU gain just from adding visual reference rulers to the image. Telling the model the image dimensions (Test 3) was a "free win" that cost nothing.
For GPT-5.4, all prompt variants produce essentially the same result (0.989–0.997). The model already understands spatial coordinates well enough that extra scaffolding adds no value.
If you're still on GPT-5.2: Always inject image dimensions into the prompt (free). Use grid overlays for the biggest single-shot gain (+0.145 IoU). With GPT-5.4, none of this is needed.
5. Validation loops: essential for GPT-5.2, Option for GPT-5.4
The feedback loop tests (6A–7B) showed that iterative self-correction genuinely helped GPT-5.2 improve from its initial prediction. For example, in Test 7A (directional feedback), GPT-5.2 improved from an init IoU of 0.926 to a best of 0.969 over 5 iterations.
For GPT-5.4, every single run hit IoU ≥ 0.99 on iteration 1 and early-stopped immediately. There was nothing left to correct. The validation loop infrastructure — overlay rendering, multi-turn prompting, iteration logic — becomes dead code you can remove from your pipeline.
GPT-5.4 (green) starts at ≥0.99 and early-stops at iteration 1.
6. Prompt instruction matters: holistic vs directional feedback
Comparing 6A/6B (holistic: "compare the two boxes and correct") with 7A/7B (directional: "for each edge, decide which direction to move"), the directional approach consistently reached higher best IoU for GPT-5.2. The per-edge structured output forced the model to reason about each boundary independently rather than making a holistic guess.
Separately, the color bias tests (6B, 7B — GT drawn in blue instead of orange) revealed that swapping GT/prediction colors drops the initial accuracy significantly. In 6A (orange GT) the init IoU was 0.937, but in 6B (blue GT) it dropped to 0.850. This suggests GPT models have learned color-role priors — orange is "expected" as the ground truth color.
However, the validation loop largely recovers this gap: after 5 iterations, 6A and 6B converge to similar best IoU (~0.96). The directional variants (7A, 7B) show the same pattern but converge faster.
Left: initial accuracy drops when GT is drawn in blue.
Right: after the validation loop, the gap closes. Directional feedback (7A/7B) shows the same pattern.
For GPT-5.4: Color bias has no measurable effect. All variants (6A/6B/7A/7B) hit 0.994–0.998 IoU on iteration 1 regardless of color assignment.
Summary: What Changed from GPT-5.2 to GPT-5.4
The story of this benchmark is really about engineering workarounds that became unnecessary. Here's what we built for GPT-5.2 and whether you still need it:
- Grid overlays & image dimensions in prompt — Gave +0.05–0.15 IoU for GPT-5.2. Not needed for GPT-5.4 (already ≥0.99 without it).
- Extended reasoning mode — Gave +0.04–0.08 IoU for GPT-5.2. No benefit for GPT-5.4 on this task (already at ceiling without it).
- Validation loops (iterative self-correction) — Improved GPT-5.2 by +0.02–0.10 IoU over 5 iterations. Unnecessary for GPT-5.4 (early-stops at iteration 1).
- Multiple runs & voting — Required for GPT-5.2 due to ±0.08 variance. Not needed for GPT-5.4 (±0.003 variance, single call sufficient).
- Color convention management — GPT-5.2 showed color bias (−0.09 IoU when colors swapped). No effect on GPT-5.4.
This benchmark started as a sanity check and turned into a clear signal: GPT-5.4 represents a genuine leap in spatial coordinate understanding, not just a marginal iteration. The gap between 0.765 and 0.997 IoU on an identical task is the difference between a prototype experiment and a production-ready component.
Try It Yourself
Ready to explore GPT-5.4's spatial precision capabilities? Here are ways to get started:
- Sample notebooks for bounding box extraction test : github
- Read the companion post: Extracting BOMs from Electrical Drawings with AI: Azure OpenAI GPT-5 + Azure Document Intelligence — See how this benchmark informed a production pipeline