Blog Post

Microsoft Foundry Blog
13 MIN READ

Extracting BOMs from Electrical Drawings with AI: Azure OpenAI GPT-5.4 + Azure Document Intelligence

jihyeseo's avatar
jihyeseo
Icon for Microsoft rankMicrosoft
Apr 22, 2026

What we learned building a multi-stage pipeline that combines vision LLMs, OCR, and iterative self-correction to extract Bills of Materials from electrical single-line diagrams.

The Problem

Electrical engineering drawings — especially single-line diagrams (SLDs) — are notoriously hard to parse programmatically. They combine dense symbols, small text, and complex geometric structures, all varying in style across documents and vendors.

Figure 1. A typical electrical single-line diagram. A single page contains multiple panel regions (HV, TR, LV, GCP), each with dozens of electrical components, connection lines, and specification text — all of which must be correctly identified and assigned to the right panel for BOM extraction.

Traditionally, extracting a Bill of Materials (BOM) from these drawings has been a manual task — engineers would read through each diagram page by page and transcribe component lists by hand. It's time-consuming, error-prone, and doesn't scale.

The obvious question: can a Vision Language Model automate this? We had to rely primarily on the visual content of PDF-converted images alone — without guaranteed access to CAD vector data or metadata. That constraint shaped every technique described in this post.

Warning: Why naive inference fails: Feeding a full diagram page to a vision model and asking for a BOM fails catastrophically — too much visual complexity in a single context. Components get missed, hallucinated, or assigned to the wrong panel.

This post shares the practical techniques we discovered while building a pipeline to solve this problem. The methods are general — applicable to any task that requires extracting structured information from visually complex technical documents.

Pipeline Architecture: Divide and Conquer

The core insight: the unit of analysis must be the panel, not the page.

A single diagram page can contain dozens of panels, each representing a distinct electrical cabinet with its own components. Asking a model to extract a complete BOM from an entire page is asking it to simultaneously locate, read, and count every component across all panels — a task that proved too complex even for GPT-5.4.

The solution was to first identify and crop each panel as an independent region, then extract the BOM panel by panel. This transformed an intractable whole-page problem into a series of manageable, well-scoped sub-tasks.

 

Figure 2. Five-stage pipeline architecture. Each stage is color-coded by its primary tool: rule-based (gray), Azure Document Intelligence (orange), hybrid Azure OpenAI+Document Intelligence (purple), and Azure OpenAI GPT-5 vision (blue). Tool tags on the right indicate specific components used at each stage.

 


Technique 1: Azure Document Intelligence-First Detection with Targeted Azure OpenAI Supplement

The first challenge is identifying all figure regions on each page — the panel diagrams that contain the actual electrical components. SLD pages typically mix diagrams with title blocks, revision tables, and legend boxes (often along the right edge or bottom). All of these must be located before we can isolate the panels.

The key design decision: use Azure Document Intelligence (DI) as the primary detector — and reserve GPT-5.4 only for gaps that DI misses. DI's prebuilt-layout model is fast, deterministic, and cheap compared to an Azure OpenAI vision call. By maximizing DI coverage first, we minimize the number of expensive Azure OpenAI invocations needed to achieve full detection.

Two-Pass Document Intelligence Layout detection: Catching Occluded Regions

A single Azure Document Intelligence (DI) pass often misses figures that are visually occluded by larger detected regions — particularly smaller panels nested within or adjacent to large ones, and tables along the page edges. The solution: white-fill detected regions and re-run DI to reveal what was hidden underneath.

# Pass 1: Detect figures on original page
pass1 = analyze_page(di_client, "prebuilt-layout", image_path)
pass1_bboxes = [f["bbox"] for f in pass1["figures"]]

# Pass 2: White-fill Pass 1 regions → re-detect
if pass1_bboxes:
    white_fill_regions(image_path, pass1_bboxes, whitefill_path)
    pass2 = analyze_page(di_client, "prebuilt-layout", whitefill_path)

    # Merge & deduplicate by IoU
    all_figures = pass1["figures"] + pass2["figures"]
    deduped = _dedup_figure_bboxes(all_figures, iou_threshold=0.5)

This two-pass approach is especially effective at catching tables and annotation blocks along the right edge or bottom of the page that DI initially subsumes into a single large region.

Azure OpenAI GPT-5.4 Only for the Remaining Gaps

AAfter two Azure Document Intelligence(DI) passes, most figure regions are covered. For any remaining gaps, Azure OpenAI GPT-5.4 is called once with the DI-detected regions marked in purple on the image. The model only needs to identify unmarked areas — a much simpler task than full-page detection from scratch.

Key finding: DI detection is ~10× faster and significantly cheaper per call than an Azure OpenAI GPT-5.4 vision request. By using DI as the primary detector and Azure OpenAI only for supplemental gap-filling, the pipeline achieves comprehensive coverage while keeping cost and latency low. The two-pass technique further reduces Azure OpenAI's burden by maximizing what DI can find on its own.


Technique 2: Hybrid Azure OpenAI GPT + Document Intelligence for Text Localization

To segment panel areas, we first need to know where panel names appear in the image. Panel names act as anchor points — once we know their locations, we can use them as seeds to identify the boundaries of each panel region.

Neither GPT-5.4 nor Azure Document Intelligence alone is sufficient:

  • GPT-5.4 : identifies what the text says, but imprecise on exact pixel locations
  • Azure Document Intelligence: precise coordinates for all text, but struggles with domain-specific abbreviations

The solution: run both in parallel and cross-match results.


Figure 3. Two parallel tracks converge into cross-matching. Top: page is split into overlapping tiles → Azure OpenAI GPT-5.4 extracts panel name candidates per tile → aggregated and deduplicated. Bottom: Azure DI extracts all text with bounding boxes → rule + Azure OpenAI GPT-5.4 filters by type → cross-matching prioritizes rule-based alignment, with Azure OpenAI GPT-5.4 resolving unmatched cases.

Tile-Based Name Extraction

Rather than feeding the entire page to Azure OpenAI, we split it into overlapping vertical tiles (2000px wide, 400px overlap) and extract panel name candidates from each tile independently. This reduces visual complexity per call and improves recall.

# Split page into overlapping tiles for LLM name extraction
tiles = tile_page(image_path, tile_width=2000, overlap=400)

# Extract names from each tile independently
for tile_img, tile_coords in tiles:
    names = extract_names_from_tile(tile_img, llm_client, deployment)
    # LLM identifies SHORT CODES: "HV 1", "TR-2", "LV 3A"
    # Rejects: component labels, wire tags, terminal IDs

Figure 4. Effect of tile-based extraction vs. full-page extraction. Overlapping tiles reduce visual complexity per LLM call, improving panel name recall — especially for names near page edges.

Hallucination Guard

Azure OpenAI GPT-5.4 model sometimes fabricate panel names that don't exist in the image. We cross-validate all Azure OpenAI GPT-5.4 model-extracted names against the Azure Document Intelligence OCR text pool using fuzzy matching:

def hallucination_guard(names, di_lines):
    verified = []
    for name in names:
        if _name_in_ocr(name, ocr_texts):  # 3 matching modes:
            verified.append(name)           #   1. Exact substring
            # 2. Same length, ≤1 char diff
            # 3. Alphanumeric stripping (ignore spaces/punctuation)
        else:
            print(f"  Dropped '{name}' — no OCR match")
    return verified

6-Pass Rule Matching Engine

Once names are verified, we locate their exact pixel positions via a cascading rule engine that matches panel names against Azure Document Intelligence OCR bounding boxes with decreasing confidence:

def rule_match_panel_name(panel_name, di_lines, max_merge=3):
    # Pass 1: Exact match (case-insensitive)          → confidence 1.0
    # Pass 2: Alphanumeric (O/0, I/1 tolerance)       → confidence 0.9
    # Pass 3: Substring containment (≥3 chars)         → confidence 0.75
    # Pass 4: Multi-line merge (adjacent DI lines)     → confidence 0.7
    # Pass 5: OCR-confusable (1-char diff, v↔y, s↔5)  → confidence 0.6

    # Conflict resolution: shared bbox → keep highest confidence
    # LLM fallback: for still-unmatched names

Key finding: GPT-5.4 identifies what the panel names are (semantic), while Azure Document Intelligence provides where they are (geometric). The rule engine bridges the two with OCR-aware fuzzy matching. This hybrid approach is significantly more robust than either system alone.


Technique 3: Iterative Locate → Verify with Oscillation Guard

With panel names located, the next challenge is identifying the full panel boundary. This is the hardest stage: panels can be irregularly shaped, share edges with neighbors, and have boundaries formed by a mix of dashed, solid, and implied lines.

Rather than asking Azure OpenAI GPT-5.4 Model to find the boundary in one shot (which is unreliable), we implemented an iterative Locate → Verify correction loop with up to 10 attempts per panel.

Visual Prompt Composition

Each iteration constructs a carefully composed image for Azure OpenAI GPT-5.4, providing spatial context through color-coded overlays:

Locate Input

  • Blue box — target panel name (NAME:{panel_name})
  • Green boxes — other panels as spatial reference

Verify Input

  • Orange box — proposed panel bbox
  • Blue box — panel name location
  • Green boxes — neighboring panels

The verify step analyzes each of the four edges independently:

// Verify response — per-edge analysis
{
  "valid": false,
  "edges": {
    "x1": { "status": "expand", "delta": -45, "corrected": 120 },
    "y1": { "status": "ok" },
    "x2": { "status": "shrink", "delta": -30, "corrected": 850 },
    "y2": { "status": "expand", "delta": 60, "corrected": 1200 }
  },
  "corrected_bbox": [120, 200, 850, 1200]
}

Oscillation Detection

A critical failure mode: Azure OpenAI GPT-5.4 oscillates on an axis — expanding, then shrinking, then expanding — never converging. We detect this using a 3-value history per axis:

# Track last 3 corrections per axis
axis_history = {axis: [] for axis in ["x1", "y1", "x2", "y2"]}

for attempt in range(1, max_tries + 1):
    # ... locate and verify ...
    for axis in AXES:
        hist = axis_history[axis]
        if len(hist) >= 3:
            a, b, c = hist[-3:]
            # Detect: (a < b > c) or (a > b < c) → oscillating
            if (a < b and b > c) or (a > b and b < c):
                corrected[axis] = hist[-2]  # Freeze at previous value

The loop processes all panels on a page in batch mode — a single locate-all call positions all panels simultaneously, reducing per-panel LLM calls from N to 1.


Technique 4: Few-Shot Visual Prompting

Text prompts alone struggle to convey spatial concepts like "what a panel boundary looks like." The visual vocabulary of electrical drawings is too domain-specific to describe in words. The solution: provide GPT-5.4 with few-shot reference images directly in the prompt.

def locate_and_verify_batch(..., guide_image_paths=None):
    guides = list(guide_image_paths or [])
    # Prepend guide images before the actual input:
    loc_raw = call_llm(
        llm_client, deployment, locate_prompt,
        image_paths=guides + [locate_img_path],  # Guides first
    )

Figure 5. Few-shot reference for panel boundary recognition. Example 1: dashed rectangle enclosure. Example 2: mixed boundary with dashed lines, solid edge, and gap + vertical line as shared boundary. Example 3: non-rectangular region returns the full outer bounding box.

The benefits:

  • Reduces prompt length — no need to describe visual concepts in words
  • Improves consistency — the model interprets boundary types correctly across varying layouts
  • User-customizable — swapping in new guide images adapts to new drawing styles without code changes

Technique 5: Reasoning Mode — Performance vs. Cost

Azure OpenAI  GPT-5.4's reasoning capability (reasoning={"effort": "low|medium|high"}) significantly affects both accuracy and latency. We ran systematic experiments across all four reasoning levels in two key pipeline stages: image area detection and BOM extraction.

Reasoning in Image Area Detection

Each reasoning level was tested 3 times on the region detection stage (Azure OpenAI GPT-5.4 detects and verifies figure boundaries).

Detection Quality


Figure 6. Region detection output across reasoning levels (High, Medium, Low, None × 3 runs). All levels produced comparable quality. high converged in fewer iterations (2), while medium/low sometimes needed 3.

Verification Iterations

Figure 7. Mean verification iterations per page (3 runs avg). high: 2.0, medium: 2.7, low: 3.0, none: 2.0. Lower reasoning needs slightly more iterations but converges to the same quality.

Processing Time

Figure 8. Mean LLM time for detection + verification (3 runs avg). high/medium ~170s, low ~52s, none ~15s.

Key finding: All reasoning levels (low, medium, high) produced similar quality and noticeably better results than none, but with increasing latency (~3× from low to high). Since there was no meaningful quality difference among reasoning levels, we chose low as the optimal setting — getting the benefit of reasoning at minimal latency cost.

Reasoning in BOM Extraction

For BOM extraction — reading component lists from cropped panel images — reasoning has a more pronounced accuracy impact:


Figure 9. Time vs. Accuracy: Low (~86%, ~2,300s) vs. Medium (~89–91%, ~3,900s) reasoning across 3 runs each. High (Timeout)
  • Low: ~85% Accuray, ~2,200–2,400s
  • Medium: ~85% ~91% Accuracy, ~3700-4100s
  • High :Timeout
 

Recommended Configurations

Pipeline StageRecommendedRationale
Image region detectionlowSame quality at ~3× less cost
Region verificationlowSufficient with rich visual context (color-coded overlays)
BOM extractionmedium+3–5% accuracy; high causes timeouts

Key insight: Different stages need different reasoning levels. Use low for spatially-grounded tasks with rich visual context, medium for semantically-demanding reading tasks. Optimize inputs before scaling compute.


Technique 6: SVG Vector Boundary Detection

While the main pipeline relies on Azure OpenAI GPT-5.4's vision for panel boundary detection, we also explored a purely geometric approach that bypasses the GPT-5.4 model entirely — useful when CAD files can be exported as SVG vectors rather than raster images.

The core idea: panel boundaries are closed geometric shapes formed by line segments in the vector data. If we can extract meaningful line clusters and find closed cycles in the resulting graph, we can identify panel regions without any LLM calls.

Figure 10. Panel regions detected via SVG vector analysis — each color represents a distinct panel boundary identified through line clustering and cycle extraction, with no LLM involvement.

 

The Problem: Too Many Lines

A single SVG page can contain 17,000+ line segments — horizontal, vertical, and diagonal — mixing panel borders, component symbols, text strokes, and wiring lines. Attempting to work with raw segments directly is impractical.

 

Figure 11. Raw SVG line extraction from a single page: 15,742 horizontal + 1,106 vertical + 227 diagonal segments. The length distribution (right) shows most segments are very short (symbol strokes, text) — boundary lines are a tiny fraction of the total.

Chain-Based Line Clustering

The solution is clustering collinear, nearby segments into coherent boundary lines. We first scan fragments in order of position. If the next fragment is close enough, extend the current group. If the gap is too large, start a new one. Discard any group that is too short or too sparse to be a real boundary. 

For example, using a DashDotBoundaryStyle strategy with 25px tolerance, 17,000+ raw segments collapse into just 15 horizontal + 18 vertical clusters — each representing a candidate panel edge.

Figure 12. After clustering: 15 horizontal lines (left) and 18 vertical lines (right), each color-coded by cluster. Legend shows pixel position and segment count per cluster. The 3,604 unassigned segments (gray) are symbol strokes and other non-boundary elements, filtered out.
# Group collinear, adjacent line segments into chains
# Lines sharing an endpoint (within tolerance) and same direction → one chain
chains = cluster_lines_to_chains(svg_lines, endpoint_tolerance=2.0)

# Filter by length — short chains are symbols, long chains are boundaries
boundary_chains = [c for c in chains if c.total_length > min_boundary_length]

Density-Peak Clustering for Boundary Lines

Dotted lines have gaps too large to bridge by proximity. Instead, project all fragments onto a ruler and count how many land at each position. Wherever fragments pile up — a spike — that's a boundary line. Find the spikes, ignore the noise.

# Find density peaks in X-coordinates of vertical lines
# and Y-coordinates of horizontal lines
x_peaks = density_peak_cluster(vertical_lines, axis='x')
y_peaks = density_peak_cluster(horizontal_lines, axis='y')

# Each peak represents a candidate panel edge position

BoundaryStyleStrategy Abstraction

Different drawings use different boundary conventions — solid lines, dashed lines, mixed styles. A strategy pattern allows the detection algorithm to adapt:

# Pluggable strategy for what constitutes a "boundary line"
class BoundaryStyleStrategy:
    def is_boundary_line(self, line) -> bool: ...
    def merge_candidates(self, lines) -> List[Chain]: ...

# Implementations:
# - SolidLineStrategy: continuous lines above length threshold
# - DashedLineStrategy: periodic short segments with gaps
# - MixedStrategy: combines both heuristics

Seed-Guided Minimal Cycle Extraction

The final step finds the actual closed regions. Using panel name locations as seeds (from Technique 2), we search for the minimal cycle in the planar graph that encloses each seed point:

# Build planar graph from boundary line segments
G = build_planar_graph(boundary_chains)

for panel_name, seed_point in panel_seeds:
    # Find minimal cycle enclosing the seed point
    cycle = find_minimal_enclosing_cycle(G, seed_point)
    if cycle:
        panel_regions[panel_name] = cycle_to_polygon(cycle)

Trade-off: This approach is faster and more precise than LLM-based detection, but requires SVG vector access (not always available) and is sensitive to non-standard drawing conventions. In our pipeline, it serves as an alternative path when vector data is accessible.


Results & Error Analysis

The pipeline achieved 94.21% accuracy (277/294 materials correctly extracted) across 53 panels on 4 diagram pages. Processing time was ~62 minutes from pre-cropped panel images.

MetricValue
Overall accuracy94.21%
Panels processed53
Materials extracted346
Correctly identified277 / 294
Processing time~62 min (panel images only)

A +5.43% accuracy improvement (88.78% → 94.21%) came from iterative prompt refinement based on domain expert review of extraction errors — identifying recurring patterns and translating them into prompt corrections.

 

Figure 13. Iterative prompt refinement: GPT-5.4 results → expert review → error pattern analysis → prompt corrections → re-run.

Key Takeaways

Decompose aggressively. Stage-wise processing with scoped inputs is the difference between working and not working. Break the problem down until each sub-task is tractable for the model.

Visual context beats reasoning effort. Color-coded overlays and few-shot images reduce reliance on expensive reasoning modes. Optimize inputs before scaling compute.

Verify, but stop early. Self-correction loops improve accuracy — but accumulated context can confuse the model. Oscillation detection and early stopping are critical.

Hybrid always wins. Azure Document Intelligence for precise coordinates, GPT-5,4 for semantics. Pure LLM solutions lose on precision-critical tasks.


What's Next?

The techniques here generalize well beyond SLDs. We're exploring several directions:

  • Other drawing types — P&ID, mechanical assembly, architectural floor plans. The core pipeline stages stay the same; only the few-shot guides and panel name patterns change per domain.
  • ERP/PLM integration — Feeding extracted BOMs directly into SAP, Oracle, or PTC Windchill to close the loop from drawing to purchase order.
  • Active learning from HITL corrections — Using human corrections captured in the Streamlit demo as training signal to drive automatic prompt refinement.
  • Cost optimization at scale — Batching Azure OpenAI calls, caching DI results for recurring templates, and leveraging SVG vector detection (Technique 6) whenever CAD exports are available.
  • Multi-modal verification — Cross-referencing extracted BOMs against parts databases or previous drawing revisions to validate extraction accuracy in context.

Get Started

Run the Demo Clone the GitHub repository and launch the Streamlit HITL demo with your own SLD drawings.

 

Figure 14. The Streamlit HITL demo showing Step 5 — BOM Extraction results. Each panel's cropped image is displayed alongside the extracted component table with device symbol, name, specification, quantity, and confidence level.

View on GitHub →

 

Explore the Services

Join the Conversation

Have questions or built something similar? Share your experience in the comments below or connect with us on the Azure AI Tech Community.

Related Reading: 

 

 

Updated Apr 20, 2026
Version 1.0
No CommentsBe the first to comment