application
108 TopicsReference Architecture for Highly Available Multi-Region Azure Kubernetes Service (AKS)
Introduction Cloud-native applications often support critical business functions and are expected to stay available even when parts of the platform fail. Azure Kubernetes Service (AKS) already provides strong availability features within a single region, such as availability zones and a managed control plane. However, a regional outage is still a scenario that architects must plan for when running important workloads. This article walks through a reference architecture for running AKS across multiple Azure regions. The focus is on availability and resilience, using practical patterns that help applications continue to operate during regional failures. It covers common design choices such as traffic routing, data replication, and operational setup, and explains the trade-offs that come with each approach. This content is intended for cloud architects, platform engineers, and Site Reliability Engineers (SREs who design and operate Kubernetes platforms on Azure and need to make informed decisions about multi-region deployments. Resilience Requirements and Design Principles Before designing a multi-region Kubernetes platform, it is essential to define resilience objectives aligned with business requirements: Recovery Time Objective (RTO): Maximum acceptable downtime during a regional failure. Recovery Point Objective (RPO): Maximum acceptable data loss. Service-Level Objectives (SLOs): Availability targets for applications and platform services. The architecture described in this article aligns with the Azure Well-Architected Framework Reliability pillar, emphasizing fault isolation, redundancy, and automated recovery. Multi-Region AKS Architecture Overview The reference architecture uses two independent AKS clusters deployed in separate Azure regions, such as West Europe and North Europe. Each region is treated as a separate deployment stamp, with its own networking, compute, and data resources. This regional isolation helps reduce blast radius and allows each environment to be operated and scaled independently. Traffic is routed at a global level using Azure Front Door together with DNS. This setup provides a single public entry point for clients and enables traffic steering based on health checks, latency, or routing rules. If one region becomes unavailable, traffic can be automatically redirected to the healthy region. Each region exposes applications through a regional ingress layer, such as Azure Application Gateway for Containers or an NGINX Ingress Controller. This keeps traffic management close to the workload and allows regional-specific configuration when needed. Data services are deployed with geo-replication enabled to support multi-region access and recovery scenarios. Centralized monitoring and security tooling provides visibility across regions and helps operators detect, troubleshoot, and respond to failures consistently. The main building blocks of the architecture are: Azure Front Door as the global entry point Azure DNS for name resolution An AKS cluster deployed in each region A regional ingress layer (Application Gateway for Containers or NGINX Ingress) Geo-replicated data services Centralized monitoring and security services Deployment Patterns for Multi-Region AKS There is no single “best” way to run AKS across multiple regions. The right deployment pattern depends on availability requirements, recovery objectives, operational maturity, and cost constraints. This section describes three common patterns used in multi-region AKS architectures and highlights the trade-offs associated with each one. Active/Active Deployment Model In an active/active deployment model, AKS clusters in multiple regions serve production traffic at the same time. Global traffic routing distributes requests across regions based on health checks, latency, or weighted rules. If one region becomes unavailable, traffic is automatically shifted to the remaining healthy region. This model provides the highest level of availability and the lowest recovery time, but it requires careful handling of data consistency, state management, and operational coordination across regions. Capability Pros Cons Availability Very high availability with no single active region Requires all regions to be production-ready at all times Failover behavior Near-zero downtime when a region fails More complex to test and validate failover scenarios Data consistency Supports read/write traffic in multiple regions Requires strong data replication and conflict handling Operational complexity Enables full regional redundancy Higher operational overhead and coordination Cost Maximizes resource utilization Highest cost due to duplicated active resources Active/Passive Deployment Model In an active/passive deployment model, one region serves all production traffic, while a second region remains on standby. The passive region is kept in sync but does not receive user traffic until a failover occurs. When the primary region becomes unavailable, traffic is redirected to the secondary region. This model reduces operational complexity compared to active/active and is often easier to operate, but it comes with longer recovery times and underutilized resources. Capability Pros Cons Availability Protects against regional outages Downtime during failover is likely Failover behavior Simpler failover logic Higher RTO compared to active/active Data consistency Easier to manage single write region Requires careful promotion of the passive region Operational complexity Easier to operate and test Manual or semi-automated failover processes Cost Lower cost than active/active Standby resources are mostly idle Deployment Stamps and Isolation Deployment stamps are a design approach rather than a traffic pattern. Each region is deployed as a fully isolated unit, or stamp, with its own AKS cluster, networking, and supporting services. Stamps can be used with both active/active and active/passive models. The goal of deployment stamps is to limit blast radius, enable independent lifecycle management, and reduce the risk of cross-region dependencies. Capability Pros Cons Availability Limits impact of regional or platform failures Requires duplication of platform components Failover behavior Enables clean and predictable failover Failover logic must be implemented at higher layers Data consistency Encourages clear data ownership boundaries Data replication can be more complex Operational complexity Simplifies troubleshooting and isolation More environments to manage Cost Supports targeted scaling per region Increased cost due to duplicated infrastructure Global Traffic Routing and Failover In a multi-region setup, global traffic routing is responsible for sending users to the right region and keeping the application reachable when a region becomes unavailable. In this architecture, Azure Front Door acts as the global entry point for all incoming traffic. Azure Front Door provides a single public endpoint that uses Anycast routing to direct users to the closest available region. TLS termination and Web Application Firewall (WAF) capabilities are handled at the edge, reducing latency and protecting regional ingress components from unwanted traffic. Front Door also performs health checks against regional endpoints and automatically stops sending traffic to a region that is unhealthy. DNS plays a supporting role in this design. Azure DNS or Traffic Manager can be used to define geo-based or priority-based routing policies and to control how traffic is initially directed to Front Door. Health probes continuously monitor regional endpoints, and routing decisions are updated when failures are detected. When a regional outage occurs, unhealthy endpoints are removed from rotation. Traffic is then routed to the remaining healthy region without requiring application changes or manual intervention. This allows the platform to recover quickly from regional failures and minimizes impact to users. Choosing Between Azure Traffic Manager and Azure DNS Both Azure Traffic Manager and Azure DNS can be used for global traffic routing, but they solve slightly different problems. The choice depends mainly on how fast you need to react to failures and how much control you want over traffic behavior. Capability Azure Traffic Manager Azure DNS Routing mechanism DNS-based with built-in health probes DNS-based only Health checks Native endpoint health probing No native health checks Failover speed (RTO) Low RTO (typically seconds to < 1 minute) Higher RTO (depends on DNS TTL, often minutes) Traffic steering options Priority, weighted, performance, geographic Basic DNS records Control during outages Automatic endpoint removal Relies on DNS cache expiration Operational complexity Slightly higher Very low Typical use cases Mission-critical workloads Simpler or cost-sensitive scenarios Data and State Management Across Regions Kubernetes platforms are usually designed to be stateless, which makes scaling and recovery much easier. In practice, most enterprise applications still depend on stateful services such as databases, caches, and file storage. When running across multiple regions, handling this state correctly becomes one of the hardest parts of the architecture. The general approach is to keep application components stateless inside the AKS clusters and rely on Azure managed services for data persistence and replication. These services handle most of the complexity involved in synchronizing data across regions and provide well-defined recovery behaviors during failures. Common patterns include using Azure SQL Database with active geo-replication or failover groups for relational workloads. This allows a secondary region to take over when the primary region becomes unavailable, with controlled failover and predictable recovery behavior. For globally distributed applications, Azure Cosmos DB provides built-in multi-region replication with configurable consistency levels. This makes it easier to support active/active scenarios, but it also requires careful thought around how the application handles concurrent writes and potential conflicts. Caching layers such as Azure Cache for Redis can be geo-replicated to reduce latency and improve availability. These caches should be treated as disposable and rebuilt when needed, rather than relied on as a source of truth. For object and file storage, Azure Blob Storage and Azure Files support geo-redundant options such as GRS and RA-GRS. These options provide data durability across regions and allow read access from secondary regions, which is often sufficient for backup, content distribution, and disaster recovery scenarios. When designing data replication across regions, architects should be clear about trade-offs. Strong consistency across regions usually increases latency and limits scalability, while eventual consistency improves availability but may expose temporary data mismatches. Replication lag, failover behavior, and conflict resolution should be understood and tested before going to production. Security and Governance Considerations In a multi-region setup, security and governance should look the same in every region. The goal is to avoid special cases and reduce the risk of configuration drift as the platform grows. Consistency is more important than introducing region-specific controls. Identity and access management is typically centralized using Azure Entra ID. Access to AKS clusters is controlled through a combination of Azure RBAC and Kubernetes RBAC, allowing teams to manage permissions in a way that aligns with existing Azure roles while still supporting Kubernetes-native access patterns. Network security is enforced through segmentation. A hub-and-spoke topology is commonly used, with shared services such as firewalls, DNS, and connectivity hosted in a central hub and application workloads deployed in regional spokes. This approach helps control traffic flows, limits blast radius, and simplifies auditing. Policy and threat protection are applied at the platform level. Azure Policy for Kubernetes is used to enforce baseline configurations, such as allowed images, pod security settings, and resource limits. Microsoft Defender for Containers provides visibility into runtime threats and misconfigurations across all clusters. Landing zones play a key role in this design. By integrating AKS clusters into a standardized landing zone setup, governance controls such as policies, role assignments, logging, and network rules are applied consistently across subscriptions and regions. This makes the platform easier to operate and reduces the risk of gaps as new regions are added. AKS Observability and Resilience Testing Running AKS across multiple regions only works if you can clearly see what is happening across the entire platform. Observability should be centralized so operators don’t need to switch between regions or tools when troubleshooting issues. Azure Monitor and Log Analytics are typically used as the main aggregation point for logs and metrics from all clusters. This makes it easier to correlate signals across regions and quickly understand whether an issue is local to one cluster or affecting the platform as a whole. Distributed tracing adds another important layer of visibility. By using OpenTelemetry, requests can be traced end to end as they move through services and across regions. This is especially useful in active/active setups, where traffic may shift between regions based on health or latency. Synthetic probes and health checks should be treated as first-class signals. These checks continuously test application endpoints from outside the platform and help validate that routing, failover, and recovery mechanisms behave as expected. Observability alone is not enough. Resilience assumptions must be tested regularly. Chaos engineering and planned failover exercises help teams understand how the system behaves under failure conditions and whether operational runbooks are realistic. These tests should be performed in a controlled way and repeated over time, especially after platform changes. The goal is not to eliminate failures, but to make failures predictable, visible, and recoverable. Conclusion and Next Steps Building a highly available, multi-region AKS platform is mostly about making clear decisions and understanding their impact. Traffic routing, data replication, security, and operations all play a role, and there are always trade-offs between availability, complexity, and cost. The reference architecture described in this article provides a solid starting point for running AKS across regions on Azure. It focuses on proven patterns that work well in real environments and scale as requirements grow. The most important takeaway is that multi-region is not a single feature you turn on. It is a set of design choices that must work together and be tested regularly. Deployment Models Area Active/Active Active/Passive Deployment Stamps Availability Highest High Depends on routing model Failover time Very low Medium Depends on implementation Operational complexity High Medium Medium to high Cost Highest Lower Medium Typical use case Mission-critical workloads Business-critical workloads Large or regulated platforms Traffic Routing and Failover Aspect Azure Front Door + Traffic Manager Azure DNS Health-based routing Yes No Failover speed (RTO) Seconds to < 1 minute Minutes (TTL-based) Traffic steering Advanced Basic Recommended for Production and critical workloads Simple or non-critical workloads Data and State management Data Type Recommended Approach Notes Relational data Azure SQL with geo-replication Clear primary/secondary roles Globally distributed data Cosmos DB multi-region Consistency must be chosen carefully Caching Azure Cache for Redis Treat as disposable Object and file storage Blob / Files with GRS or RA-GRS Good for DR and read scenarios Security and Governance Area Recommendation Identity Centralize with Azure Entra ID Access control Combine Azure RBAC and Kubernetes RBAC Network security Hub-and-spoke topology Policy enforcement Azure Policy for Kubernetes Threat protection Defender for Containers Governance Use landing zones for consistency Observability and Testing Practice Why It Matters Centralized monitoring Faster troubleshooting Metrics, logs, traces Full visibility across regions Synthetic probes Early failure detection Failover testing Validate assumptions Chaos engineering Build confidence in recovery Recommended Next Steps If you want to move from design to implementation, the following steps usually work well: Start with a proof of concept using two regions and a simple workload Define RTO and RPO targets and validate them with tests Create operational runbooks for failover and recovery Automate deployments and configuration using CI/CD and GitOps Regularly test failover and recovery, not just once For deeper guidance, the Azure Well-Architected Framework and the Azure Architecture Center provide additional patterns, checklists, and reference implementations that build on the concepts discussed here.1.3KViews6likes6CommentsCompany Portal Installation failing due to missing Microsoft.UI.Xaml.2.7
Dear All, We are deploying Company Portal App as Microsoft Store app (new) from Intune on Hybrid Domain Joined devices. While some devices are successfull to install company portal, some device are failing. I did review of events in, below locations subfolders. Event Viewer -->Applications and Services Logs --> Microsoft --> Windows --> Appx Deployment Event Viewer -->Applications and Services Logs --> Microsoft --> Windows --> Appx Deployment-Server. Event Viewer -->Applications and Services Logs --> Microsoft --> Windows --> Appx Deployment-Server-Undocked Event Viewer -->Applications and Services Logs --> Microsoft --> Windows --> AppxPackagingOM During the review I found error 0x80073cf3: Package failed updates, dependency or conflict validation. This is the reason for Company Portal App failed installation. This is due to lack of Microsoft.UI.Xaml.2.7 installed on the device. If i execute below commands 1 after another in the command prompt, Installation of Company Portal gets succeeded. Winget Install --accept-source-agreements --accept-package-agreements Microsoft.UI.Xaml.2.7 Winget Install --accept-source-agreements --accept-package-agreements Microsoft.CompanyPortal My question is how can i add the Microsoft.UI.Xaml.2.7 as a dependency app for Company Portal App, especially when the app type is Microsoft Store app (new) ? I do not want to deploy Company Portal as win32 app and also deploy the Microsoft.UI.Xaml.2.7 as win32 app, because in this method of deployment i always have to create new win32app when a new version is released. Does anyone came across same situation and have any thoughts ?578Views0likes12CommentsAzure Course Blueprints
Each Blueprint serves as a 1:1 visual representation of the official Microsoft instructor‑led course (ILT), ensuring full alignment with the learning path. This helps learners: see exactly how topics fit into the broader Azure landscape, map concepts interactively as they progress, and understand the “why” behind each module, not just the “what.” Formats Available: PDF · Visio · Excel · Video Every icon is clickable and links directly to the related Learn module. Layers and Cross‑Course Comparisons For expert‑level certifications like SC‑100 and AZ‑305, the Visio Template+ includes additional layers for each associate-level course. This allows trainers and students to compare certification paths at a glance: 🔐 Security Path SC‑100 side‑by‑side with SC‑200, SC‑300, AZ‑500 🏗️ Infrastructure & Dev Path AZ‑305 alongside AZ‑104, AZ‑204, AZ‑700, AZ‑140 This helps learners clearly identify: prerequisites, skill gaps, overlapping modules, progression paths toward expert roles. Because associate certifications (e.g., SC‑300 → SC‑100 or AZ‑104 → AZ‑305) are often prerequisites or recommended foundations, this comparison layer makes it easy to understand what additional knowledge is required as learners advance. Azure Course Blueprints + Demo Deploy Demos are essential for achieving end‑to‑end understanding of Azure. To reduce preparation overhead, we collaborated with Peter De Tender to align each Blueprint with the official Trainer Demo Deploy scenarios. With a single click, trainers can deploy the full environment and guide learners through practical, aligned demonstrations. https://aka.ms/DemoDeployPDF Benefits for Students 🎯 Defined Goals Learners clearly see the skills and services they are expected to master. 🔍 Focused Learning By spotlighting what truly matters, the Blueprint keeps learners oriented toward core learning objectives. 📈 Progress Tracking Students can easily identify what they’ve already mastered and where more study is needed. 📊 Slide Deck Topic Lists (Excel) A downloadable .xlsx file provides: a topic list for every module, links to Microsoft Learn, prerequisite dependencies. This file helps students build their own study plan while keeping all links organized. Download links Associate Level PDF - Demo Visio Contents AZ-104 Azure Administrator Associate R: 12/14/2023 U: 12/17/2025 Blueprint Demo Video Visio Excel AZ-204 Azure Developer Associate R: 11/05/2024 U: 12/17/2025 Blueprint Demo Visio Excel AZ-500 Azure Security Engineer Associate R: 01/09/2024 U: 10/10/2024 Blueprint Demo Visio+ Excel AZ-700 Azure Network Engineer Associate R: 01/25/2024 U: 12/17/2025 Blueprint Demo Visio Excel SC-200 Security Operations Analyst Associate R: 04/03/2025 U:04/09/2025 Blueprint Demo Visio Excel SC-300 Identity and Access Administrator Associate R: 10/10/2024 Blueprint Demo Excel Specialty PDF Visio AZ-140 Azure Virtual Desktop Specialty R: 01/03/2024 U: 12/17/2025 Blueprint Demo Visio Excel Expert level PDF Visio AZ-305 Designing Microsoft Azure Infrastructure Solutions R: 05/07/2024 U: 12/17/2025 Blueprint Demo Visio+ AZ-104 AZ-204 AZ-700 AZ-140 Excel SC-100 Microsoft Cybersecurity Architect R: 10/10/2024 U: 04/09/2025 Blueprint Demo Visio+ AZ-500 SC-300 SC-200 Excel Skill based Credentialing PDF AZ-1002 Configure secure access to your workloads using Azure virtual networking R: 05/27/2024 Blueprint Visio Excel AZ-1003 Secure storage for Azure Files and Azure Blob Storage R: 02/07/2024 U: 02/05/2024 Blueprint Excel Subscribe if you want to get notified of any update like new releases or updates. Author: Ilan Nyska, Microsoft Technical Trainer My email ilan.nyska@microsoft.com LinkedIn https://www.linkedin.com/in/ilan-nyska/ I’ve received so many kind messages, thank-you notes, and reshares — and I’m truly grateful. But here’s the reality: 💬 The only thing I can use internally to justify continuing this project is your engagement — through this survey https://lnkd.in/gnZ8v4i8 ___ Benefits for Trainers: Trainers can follow this plan to design a tailored diagram for their course, filled with notes. They can construct this comprehensive diagram during class on a whiteboard and continuously add to it in each session. This evolving visual aid can be shared with students to enhance their grasp of the subject matter. Explore Azure Course Blueprints! | Microsoft Community Hub Visio stencils Azure icons - Azure Architecture Center | Microsoft Learn ___ Are you curious how grounding Copilot in Azure Course Blueprints transforms your study journey into smarter, more visual experience: 🧭 Clickable guides that transform modules into intuitive roadmaps 🌐 Dynamic visual maps revealing how Azure services connect ⚖️ Side-by-side comparisons that clarify roles, services, and security models Whether you're a trainer, a student, or just certification-curious, Copilot becomes your shortcut to clarity, confidence, and mastery. Navigating Azure Certifications with Copilot and Azure Course Blueprints | Microsoft Community Hub33KViews15likes18CommentsBoosting Hybrid Cloud Data Efficiency for EDA: The Power of Azure NetApp Files cache volumes
Electronic Design Automation (EDA) is the foundation of modern semiconductor innovation, enabling engineers to design, simulate, and validate increasingly sophisticated chip architectures. As designs push the boundaries of PPA (Power, Performance, and reduced Area) to meet escalating market demands, the volume of associated design data has surged exponentially with a single System-on-Chip (SoC) project generating multiple petabytes of data during its development lifecycle, making data mobility and accessibility critical bottlenecks. To overcome these challenges, Azure NetApp Files (ANF) cache volumes are purpose-built to optimize data movement and minimize latency, delivering high-speed access to massive design datasets across distributed environments. By mitigating data gravity, Azure NetApp Files cache volumes empower chip designers to leverage cloud-scale compute resources on demand and at scale, thus accelerating innovation without being constrained by physical infrastructure.578Views0likes0CommentsDeploy PostgreSQL on Azure VMs with Azure NetApp Files: Production-Ready Infrastructure as Code
PostgreSQL is a popular open‑source cloud database for modern web applications and AI/ML workloads, and deploying it on Azure VMs with high‑performance storage should be simple. In practice, however, using Azure NetApp Files requires many coordinated steps—from provisioning networking and storage to configuring NFS, installing and initializing PostgreSQL, and maintaining consistent, secure, and high‑performance environments across development, test, and production. To address this complexity, we’ve built production‑ready Infrastructure as Code templates that fully automate the deployment, from infrastructure setup to database initialization, ensuring PostgreSQL runs on high‑performance Azure NetApp Files storage from day one.365Views1like0CommentsWhat's New with Azure NetApp Files VS Code Extension
The latest update to the Azure NetApp Files (ANF) VS Code Extension introduces powerful enhancements designed to simplify cloud storage management for developers. From multi-tenant support to intuitive right-click mounting and AI-powered commands, this release focuses on improving productivity and streamlining workflows within Visual Studio Code. Explore the new features, learn how they accelerate development, and see why this extension is becoming an essential tool for cloud-native applications.223Views0likes0CommentsFrom Large Semi-Structured Docs to Actionable Data: In-Depth Evaluation Approaches Guidance
Introduction Extracting structured data from large, semi-structured documents (the detailed solution implementation overview and architecture is provided in this tech community blog: From Large Semi-Structured Docs to Actionable Data: Reusable Pipelines with ADI, AI Search & OpenAI) demands a rigorous evaluation framework. The goal is to ensure our pipeline is accurate, reliable, and scalable before we trust it with mission-critical data. This framework breaks evaluation into clear phases, from how we prepare the document, to how we find relevant parts, to how we validate the final output. It provides metrics, examples, and best practices at each step, forming a generic pattern that can be applied to various domains. Framework Overview A very structured and stepwise approach for evaluation is given below: Establish Ground Truth & Sampling: Define a robust ground truth set and sampling method to fairly evaluate all parts of the document. Preprocessing Evaluation: Verify that OCR, chunking, and any structural augmentation (like adding headers) preserve all content and context. Labelling Evaluation: Check classification of sections/chunks by content based on topic/entity and ensure irrelevant data is filtered out without losing any important context. Retrieval Evaluation: Ensure the system can retrieve the right pieces of information (using search) with high precision@k and recall@k. Extraction Accuracy Evaluation: Measure how well the final structured data matches the expected values (field accuracy, record accuracy, overall precision/recall). Continuous Improvement Loop with SME: Use findings to retrain, tweak, and improve, enabling the framework to be reused for new documents and iterations. SMEs play a huge role in such scenarios. Detailed Guidance on Evaluation Below is a step-by-step, in-depth guide to evaluating this kind of IDP (Indelligent Document Processing) pipeline, covering both the overall system and its individual components: Establish Ground Truth & Sampling Why: Any evaluation is only as good as the ground truth it’s compared against. Start by assembling a reliable “source of truth” dataset for your documents. This often means manual labelling of some documents by domain experts (e.g., a legal team annotating key clauses in a contract, or accountants verifying invoice fields). Because manual curation is expensive, be strategic in what and how we sample. Ground Truth Preparation: Identify the critical fields and sections we need to extract, and create an annotated set of documents with those values marked correct. For example, if processing financial statements, we might mark the ground truth values for Total Assets, Net Income, Key Ratios, etc. This ground truth should be the baseline to measure accuracy against. Although creating it is labour-intensive, it yields a precise benchmark for model performance. Stratified Sampling: Documents like contracts or policies have diverse sections. To evaluate thoroughly, use stratified sampling – ensure your test set covers all major content types and difficulty levels. For instance, if 15% of pages in a set of contracts are annexes or addendums, then ~15% of your evaluation pages should come from annexes, not just the main body. This prevents the evaluation from overlooking challenging or rare sections. In practice, we might partition a document by section type (clauses, tables, schedules, footnotes) and sample a proportion from each. This way, metrics reflect performance on each type of content, not just the easiest portions. Multi-Voter Agreement (Consensus): It’s often helpful to have multiple automated voters on the outputs before involving humans. For example, suppose we extracted an invoice amount; we can have: A regex/format checker/fuzzy matching voter A cross-field logic checker/embedding based matching voter An ML model confidence score/LLM as a judge vote If all signals are strong, we label that extraction as Low Risk; if they conflict, mark it High Risk for human review. By tallying such “votes”, we create tiers of confidence. Why? Because in many cases, a large portion of outputs will be obviously correct (e.g., over 80% might have unanimous high confidence), and we can safely assume those are right, focusing manual review on the remainder. This strategy effectively reduces the human workload while maintaining quality. Preprocessing Evaluation Before extracting meaning, make sure the raw text and structure are captured correctly. Any loss here breaks the whole pipeline. Key evaluation checks: OCR / Text Extraction Accuracy Character/Error Rate: Sample pages to see how many words are recognized correctly (use per-word confidence to spot issues). Layout Preservation: Ensure reading order isn’t scrambled, especially in multi-column pages or footnotes. Content Coverage: Verify every sentence from a sample page appears in the extracted text. Missing footers or sidebars count as gaps. Chunking Completeness: Combined chunks should reconstruct the full document. Word counts should match. Segment Integrity: Chunks should align to natural boundaries (paragraphs, tables). Track a metric like “95% clean boundaries.” Context Preservation: If a table or section spans chunks, mark relationships so downstream logic sees them as connected. Multi-page Table Handling Header Insertion Accuracy: Validate that continued pages get the correct header (aim for high 90% to maintain context across documents). No False Headers: Ensure new tables aren’t mistakenly treated as continuations. Track a False Continuation Rate and push it to near zero. Practical Check: Sample multi-page tables across docs to confirm consistent extraction and no missed rows. Structural Links / References Link Accuracy: Confirm references (like footnotes or section anchors) map to the right targets (e.g., 98%+ correct). Ontology / Category Coverage: If content is pre-grouped, check precision (no mis-grouping) and recall (nothing left uncategorized). Implication The goal is to ensure the pre-processed chunks are a faithful, complete, and structurally coherent representation of the original document. Metrics like content coverage, boundary cleanliness, and header accuracy help catch issues early. Fixing them here saves significant downstream debugging. Labelling Evaluation – “Did we isolate the right pieces?” Once we chunk the document, we label those chunks (with ML or rules) to map them to the right entities and throw out the noise. Think of this step as sorting useful clauses from filler. Section/Entity Labelling Accuracy Treat labelling as a multi-class or multi-label classification problem. Precision (Label Accuracy): Of the chunks we labelled as X, how many were actually X? Example: Model tags 40 chunks as “Financial Data.” If 5 are wrong, precision is 87.5. High precision avoids polluting a category (topic/entity) with junk. Recall (Coverage): Of the chunks that truly belong to category X, how many did we catch? Example: Ground truth has 50 Financial Data chunks, model finds 45. Recall is 90%. High recall prevents missing important sections. Example: A model labels paper sections as Introduction, Methods, Results, etc. It marks 100 sections as Results and 95 are correct (95% precision). It misses 5 actual Results (slightly lower recall). That’s acceptable if downstream steps can still recover some items. But low precision means the labelling logic needs tightening. Implication Low precision means wrong info contaminates the category. Low recall means missing crucial bits. Use these metrics to refine definitions or adjust the labelling logic. Don’t just report one accuracy number; precision and recall per label tell the real story. Retrieval Evaluation – “Can we find the right info when we ask?” Many document pipelines use retrieval to narrow a huge file down to the few chunks most likely to contain the answer corresponding to a topic/entity. If we need a “termination date,” we first fetch chunks about dates or termination, then extract from those. Retrieval must be sharp, or everything downstream suffers. Precision@K How many of the top K retrieved chunks are actually relevant? If we grab 5 chunks for “Key Clauses” and 4 are correct, Precision@5 is 80%. We usually set K to whatever the next stage consumes (3 or 5). High precision keeps extraction clean. Average it across queries or fields. Critical fields may demand very high Precision@K. Recall@K Did we retrieve enough of the relevant chunks? If there are 2 relevant chunks in the doc but the top 5 results include only 1, recall is 50%. Good recall means we aren’t missing mentions in other sections or appendices. Increasing K improves recall but can dilute precision. Tune both together. Ranking Quality (MRR, NDCG) If order matters, use rank-aware metrics. MRR: Measures how early the first relevant result appears. Perfect if it’s always at rank 1. NDCG@K: Rewards having the most relevant chunks at the top. Useful when relevance isn’t binary. Most pipelines can get away with Precision@K and maybe MRR. Implication Test 50 QA pairs from policy documents, retrieving 3 passages per query. Average Precision@3: 85%. Average Recall@3: 92%. MRR: 0.8. Suppose, we notice “data retention” answers appear in appendices that sometimes rank low. We increase K to 5 for that query type. Precision@3 rises to 90%, and Recall@5 hits roughly 99%. Retrieval evaluation is a sanity check. If retrieval fails, extraction recall will tank no matter how good the extractor is. Measure both so we know where the leak is. Also keep an eye on latency and cost if fancy re-rankers slow things down. Extraction Accuracy Evaluation – “Did we get the right answers?” Look at each field and measure how often we got the right value. Precision: Of the values we extracted, what percent are correct? Use exact match or a lenient version if small format shifts don’t matter. Report both when useful. Recall: Out of all ground truth values, how many did we actually extract? Per-field breakdown: Some fields will be easy (invoice numbers, dates), others messy (vendor names, free text). A simple table makes this obvious and shows where to focus improvements. Error Analysis Numbers don’t tell the whole story. Look at patterns: OCR mix-ups Bad date or amount formats Wrong chunk retrieved upstream Misread tables Find the recurring mistakes. That’s where the fixes live. Holistic Metrics If needed, compute overall precision/recall across all extracted fields. But per-field and record-level are usually what matter to stakeholders. Implication Precision protects against wrong entries. Recall protects against missing data. Choose your balance based on risk: If false positives hurt more (wrong financial numbers), favour precision. If missing items hurts more (missing red-flag clauses), favour recall. Continuous Improvement Loop with SME Continuous improvement means treating evaluation as an ongoing feedback loop rather than a one-time check. Each phase’s errors point to concrete fixes, and every fix is re-measured to ensure accuracy moves in the right direction without breaking other components. The same framework also supports A/B testing alternative methods and monitoring real production data to detect drift or new document patterns. Because the evaluation stages are modular, they generalize well across domains such as contracts, financial documents, healthcare forms, or academic papers with only domain-specific tweaks. Over time, this creates a stable, scalable and measurable path toward higher accuracy, better robustness, and easier adaptation to new document types. Conclusion Building an end-to-end evaluation framework isn’t just about measuring accuracy, it’s about creating trust in the entire pipeline. By breaking the process into clear phases, defining robust ground truth, and applying precision/recall-driven metrics at every stage, we ensure that document processing systems are reliable, scalable, and adaptable. This structured approach not only highlights where improvements are needed but also enables continuous refinement through SME feedback and iterative testing. Ultimately, such a framework transforms evaluation from a one-time exercise into a sustainable practice, paving the way for higher-quality outputs across diverse domains.337Views2likes1CommentFrom Large Semi-Structured Docs to Actionable Data: Reusable Pipelines with ADI, AI Search & OpenAI
Problem Space Large semi-structured documents such as contracts, invoices, hospital tariff/rate cards multi-page reports, and compliance records often carry essential information that is difficult to extract reliably with traditional approaches. Their layout can span several pages, the structure is rarely consistent, and related fields may appear far apart even though they must be interpreted together. This makes it hard not only to detect the right pieces of information but also to understand how those pieces relate across the document. LLM can help, but when documents are long and contain complex cross-references, they may still miss subtle dependencies or generate hallucinated information. That becomes risky in environments where small errors can cascade into incorrect decisions. At the same time, these documents don’t change frequently, while the extracted data is used repeatedly by multiple downstream systems at scale. Because of this usage pattern, a RAG-style pipeline is often not ideal in terms of cost, latency, or consistency. Instead, organizations need a way to extract data once, represent it consistently, and serve it efficiently in a structured form to a wide range of applications, many of which are not conversational AI systems. At this point, data stewardship becomes critical, because once information is extracted, it must remain accurate, governed, traceable, and consistent throughout its lifecycle. When the extracted information feed compliance checks, financial workflows, risk models, or end-user experiences, the organization must ensure that the data is not just captured correctly but also maintained with proper oversight as it moves across systems. Any extraction pipeline that cannot guarantee quality, reliability, and provenance introduces long-term operational risk. The core problem, therefore, is finding a method that handles the structural and relational complexity of large semi-structured documents, minimizes LLM hallucination risk, produces deterministic results, and supports ongoing data stewardship so that the resulting structured output stays trustworthy and usable across the enterprise. Target Use Cases The potential applications for an Intelligent Document Processing (IDP) pipeline differ across industries. Several industry-specific use cases are provided as examples to guide the audience in conceptualising and implementing solutions tailored to their unique requirements. Hospital Tariff Digitization for Tariff-Invoice Reconciliation in Health Insurance Document types: Hospital tariff/rate cards, annexures/guidelines, pre-authorization guidelines etc. Technical challenge: Charges for the same service might appear under different sections or for different hospital room types across different versions of tariff/rate cards. Table + free text mix, abbreviations, and cross-page references. Downstream usage: Reimbursement orchestration, claims adjudication Commercial Loan Underwriting in Banking Document types: Balance sheets, cash-flow statements, auditor reports, collateral documents. Technical Challenge: Ratios and covenants must be computed from fields located across pages. Contextual dependencies: “Net revenue excluding exceptional items” or footnotes that override values. Downstream usage: Loan decisioning models, covenant monitoring, credit scoring. Procurement Contract Intelligence in Manufacturing Document types: Vendor agreements, SLAs, pricing annexures. Technical Challenge: Pricing rules defined across clauses that reference each other. Penalty and escalation conditions hidden inside nested sections. Downstream usage: Automated PO creation, compliance checks. Regulatory Compliance Extraction Document types: GDPR/HIPAA compliance docs, audit reports. Technical Challenge: Requirements and exceptions buried across many sections. Extraction must be deterministic since compliance logic is strict. Downstream usage: Rule engines, audit workflows, compliance checklist. Solution Approaches Problem Statement Across industries from finance and healthcare to legal and compliance, large semi-structured documents serve as the single source of truth for critical workflows. These documents often span hundreds of pages, mixing free text, tables, and nested references. Before any automation can validate transactions, enforce compliance, or perform analytics, this information must be transformed into a structured, machine-readable format. The challenge isn’t just size; it’s complexity. Rules and exceptions are scattered, relationships span multiple sections, and formatting inconsistencies make naive parsing unreliable. Errors at this stage ripple downstream, impacting reconciliation, risk models, and decision-making. In short, the fidelity of this digitization step determines the integrity of every subsequent process. Solving this problem requires a pipeline that can handle structural diversity, preserve context, and deliver deterministic outputs at scale. Challenges There are many challenges which can arise while solving for such large complex documents. The documents can have ~200-250 pages. The documents structures and layouts can be extremely complex in nature. A document or a page may contain a mix of various layouts like tables, text blocks, figures etc. Sometimes a single table can stretch across multiple pages, but only the first page contains the table header, leaving the remaining pages without column labels. A topic on one page may be referenced from a different page, so there can be complex inter-relationship amongst different topics in the same documents which needs to be structured in a machine-readable format. The document can be semi-structured as well (some parts are structured; some parts are unstructured or free text) The downstream applications might not always be AI-assisted (it can be core analytics dashboard or existing enterprise legacy system), so the structural storage of the digitized items from the documents need to be very well thought out before moving ahead with the solution. Motivation Behind High Level Approach A larger document (number of pages ~200) needs to be divided into smaller chunks so that it becomes readable and digestible (within context length) for the LLM. To make the content/input of the LLM truly context-aware, the references must be maintained across pages (for example, table headers of long and continuous tables need to be injected to those chunks which would have the tables without the headers). If a pre-defined set of topics/entities are being covered in the documents in consideration, then topic/entity-wise information needs to be extracted for making the system truly context-aware. Different chunks can cover similar topic/entity which becomes a search problem The retrieval needs to happen for every topic/entity so that all information related to one topic/entity are in a single place and as a result the downstream applications become efficient, scalable and reliable over time. Sample Architecture and Implementation Let’s take a possible approach to demonstrate the feasibility of the following architecture, building on the motivation outlined above. The solution divides a large, semi-structured document into manageable chunks, making it easier to maintain context and references across pages. First, the document is split into logical sections. Then, OCR and layout extraction capture both text and structure, followed by structure analysis to preserve semantic relationships. Annotated chunks are labeled and grouped by entity, enabling precise extraction of items such as key-value pairs or table data. As a result, the system efficiently transforms complex documents into structured, context-rich outputs ready for downstream analytics and automation. Architecture Components The key elements of the architecture diagram include components 1-6, which are code modules. Components 7 and 8 represent databases that store data chunks and extracted items, while component 9 refers to potential downstream systems that will use the structured data obtained from extraction. Chunking: Break documents into smaller, logical sections such as pages or content blocks. Enables parallel processing and improves context handling for large files. Technology: Python-based chunking logic using pdf2image and PIL for image handling. OCR & Layout Extraction: Convert scanned images into machine-readable text while capturing layout details like bounding boxes, tables, and reading order for structural integrity. Technology: Azure Document Intelligence or Microsoft Foundry Content Understanding Prebuilt Layout model combining OCR with deep learning for text, tables, and structure extraction. Context Aware Structural Analysis: Analyse the extracted layout to identify document components such as headers, paragraphs, and tables. Preserves semantic relationships for accurate interpretation. Technology: Custom Python logic leveraging OCR output to inject missing headers, summarize layout (row/column counts, sections per page). Labelling: Assign entity-based labels to chunks according to predefined schema or SME input. Helps filter irrelevant content and focus on meaningful sections. Technology: Azure OpenAI GPT-4.1-mini with NLI-style prompts for multi-class classification. Entity-Wise Grouping: Organize chunks by entity type (e.g., invoice number, total amount) for targeted extraction. Reduces noise and improves precision in downstream tasks. Technology: Azure AI Search with Hybrid Search and Semantic Reranking for grouping relevant chunks. Item Extraction: Extract specific values such as key-value pairs, line items, or table data from grouped chunks. Converts semi-structured content into structured fields. Technology: Azure OpenAI GPT-4.1-mini with Set of Marking style prompts using layout clues (row × column, headers, OCR text). Interim Chunk Storage: Store chunk-level data including OCR text, layout metadata, labels, and embeddings. Supports traceability, semantic search, and audit requirements. Technology: Azure AI Search for chunk indexing and Azure OpenAI Embedding models for semantic retrieval. Document Store: Maintain final extracted items with metadata and bounding boxes. Enables quick retrieval, validation, and integration with enterprise systems. Technology: Azure Cosmos DB, Azure SQL DB, Azure AI Search, or Microsoft Fabric depending on downstream needs (analytics, APIs, LLM apps). Downstream Integration: Deliver structured outputs (JSON, CSV, or database records) to business applications or APIs. Facilitates automation and analytics across workflows. Technology: REST APIs, Azure Functions, or Data Pipelines integrated with enterprise systems. Algorithms Consider these key algorithms when implementing the components above: Structural Analysis – Inject headers: Detect tables page by page; compare the last row of a table on page i with the first row of a table on page i+1, if column counts match and ≥4/5 style features (Font Weight, Background Colour, Font Style, Foreground Colour, Similar Font Family) match, mark it as a continuous table (header missing) and inject the previous page’s header into the next page’s table, repeating across pages. Labelling – Prompting Guide: Run NLI checks per SOC chunk image (ground on OCR text) across N curated entity labels, return {decision ∈ {ENTAILED, CONTRADICTED, NEUTRAL}, confidence ∈ [0,1]}, and output only labels where decision = ENTAILED and confidence > 0.7. Entity-Wise Grouping – Querying Chunks per Entity & Top‑50 Handling: Construct the query from the entity text and apply hybrid search with label filters for Azure AI Search, starting with chunks where the target label is sole, then expanding to observed co‑occurrence combinations under a cap to prevent explosion. If label frequency >50, run staged queries (sole‑label → capped co‑label combos); otherwise use a single hybrid search with semantic reranking, merge results and deduplicate before scoring. Entity-Wise Grouping – Chunk to Entity relevance scoring: For each retrieved chunk, split text into spans; compute cosine similarities to the entity text and take the mean s. Boost with a gated nonlinearity b=σ(k(s-m))⋅s. where σ is sigmoid function and k,m are tunables to emphasize mid-range relevance while suppressing very low s. Min–max normalize the re-ranker score r → r_norm; compute the final score F=α*b+(1-α)*r_norm, and keep the chunk iff F≥τ. Item Extraction – Prompting Guide: Provide the chunk image as an input and ground on visual structure (tables, headers, gridlines, alignment, typography) and document structural metadata to segment and align units; reconcile ambiguities via OCR extracted text, then enumerate associations by positional mapping (header ↔ column, row ↔ cell proximity) and emit normalized objects while filtering narrative/policy text by layout and pattern cues. Deployment at Scale There are several ways to implement a document extraction pipeline, each with its own pros and cons. The best deployment model depends on scenario requirements. Below are some common approaches with their advantages and disadvantages. Host as REST API Pros: Enables straightforward creation, customization, and deployment across scalable compute services such as Azure Kubernetes Service. Cons: Processing time and memory usage scale with document size and complexity, potentially requiring multiple iterations to optimize performance. Deploy as Azure Machine Learning (ML) Pipeline Pros: Facilitates efficient time and memory management, as Azure ML supports processing large datasets at scale. Cons: The pipeline may be more challenging to develop, customize, and maintain. Deploy as Azure Databricks Job Pros: Offers robust time and memory management similar to Azure ML, with advanced features such as Data Autoloader for detecting data changes and triggering pipeline execution. Cons: The solution is highly tailored to Azure Databricks and may have limited customization options. Deploy as Microsoft Fabric Pipeline Pros: Provides capabilities comparable to Azure ML and Databricks, and features like Fabric Activator replicate Databricks Autoloader functionality. Cons: Presents similar limitations found in Azure ML and Azure Databricks approaches. Each method should be carefully evaluated to ensure alignment with technical and operational requirements. Evaluation Objective: The aim is to evaluate how accurately a document extraction pipeline extracts information by comparing its output with manually verified data. Approach: Documents are split into sections, labelled, and linked to relevant entities; then, AI tools extract key items through the outlined pipeline mentioned above. The extracted data is checked against expert-curated records using both exact and approximate matching techniques. Key Metrics: Individual Item Attribute Match: Assesses the system’s ability to identify specific item attributes using strict and flexible comparison methods. Combined Item Attribute Match: Evaluates how well multiple attributes are identified together, considering both exact and fuzzy matches. Precision Calculation: Precision for each metric reflects the proportion of correctly matched entries compared to all reference entries. Findings for a real-world scenario: Fuzzy matching of item key attributes yields high precision (over 90%), but accuracy drops for key attribute combinations (between 43% and 48%). These results come from analysis across several datasets to ensure reliability. How This Addresses the Problem Statement The sample architecture described integrates sectioning, entity linking, and attribute extraction as foundational steps. Each extracted item is then evaluated against expert-curated datasets using both strict (exact) and flexible (fuzzy) matching algorithms. This approach directly addresses the problem statement by providing measurable metrics, such as individual and combined attribute match rates and precision calculations, that quantify the system’s reliability and highlight areas for improvement. Ultimately, this methodology ensures that the pipeline’s output is systematically validated, and its strengths and limitations are clearly understood in real-world contexts. Plausible Alternative Approaches No single approach fits every use case; the best method depends on factors like document complexity, structure, sensitivity, and length as well as the downstream application types, Consider these alternative approaches for different scenarios. Using Azure OpenAI alone Article: Best Practices for Structured Extraction from Documents Using Azure OpenAI Using Azure OpenAI + Azure Document Intelligence + Azure AI Search: RAG like solution Article 1: Document Field Extraction with Generative AI Article 2: Complex Data Extraction using Document Intelligence and RAG Article 3: Design and develop a RAG solution Using Azure OpenAI + Azure Document Intelligence + Azure AI Search: Non-RAG like solution Article: Using Azure AI Document Intelligence and Azure OpenAI to extract structured data from documents GitHub Repository: Content processing solution accelerator Conclusion Intelligent Document Processing for large semi-structured documents isn’t just about extracting data, it’s about building trust in that data. By combining Azure Document Intelligence for layout-aware OCR with OpenAI models for contextual understanding, we create a well thought out in-depth pipeline that is accurate, scalable, and resilient against complexity. Chunking strategies ensure context fits within model limits, while header injection and structural analysis preserve relationships across pages to make it context-aware. Entity-based grouping and semantic retrieval transform scattered content into organized, query-ready data. Finally, rigorous evaluation with scalable ground truth strategy roadmap, using precision, recall, and fuzzy matching, closes the loop, ensuring reliability for downstream systems. This pattern delivers more than automation; it establishes a foundation for compliance, analytics, and AI-driven workflows at enterprise scale. In short, it’s a blueprint for turning chaotic document into structured intelligence, efficient, governed, and future-ready for any kind of downstream applications. End-to-End Evaluation Approaches Guidance Given the complexity of this system, it should undergo a thorough end-to-end evaluation to ensure correctness, robustness, and performance across the pipeline. Continuous monitoring and observability of these metrics will enable iterative improvements and help the system scale reliably as requirements evolve. If you would like to read more about the end-to-end evaluation approaches guidance, please refer to our tech community blog: From Large Semi-Structured Docs to Actionable Data: In-Depth Evaluation Approaches Guidance. References Azure Content Understanding in Foundry Tools Azure Document Intelligence in Foundry Tools Azure OpenAI in Microsoft Foundry models Azure AI Search Azure Machine Learning (ML) Pipelines Azure Databricks Job Microsoft Fabric Pipeline624Views0likes0CommentsStreamline Azure NetApp Files Management—Right from Your IDE
The Azure NetApp Files VS Code Extension is designed to streamline storage provisioning and management directly within the developer’s IDE. Traditional workflows often require extensive portal navigation, manual configuration, and policy management, leading to inefficiencies and context switching. The extension addresses these challenges by enabling AI-powered automation through natural language commands, reducing provisioning time from hours to minutes while minimizing errors and improving compliance. Key capabilities include generating production-ready ARM templates, validating resources, and delivering optimization insights—all without leaving the coding environment.218Views0likes0Comments