artificial intelligence

70 Topics

Unlocking Advanced Data Analytics & AI with Azure NetApp Files object REST API
Azure NetApp Files object REST API enables object access to enterprise file data stored on Azure NetApp Files, without copying, moving, or restructuring that data. This capability allows analytics and AI platforms that expect object storage to work directly against existing NFS based datasets, while preserving Azure NetApp Files’ performance, security, and governance characteristics.
GeertVanTeylingen
Feb 10, 2026 Place Azure Architecture Blog
424Views
0likes
0Comments
Building a Secure and Compliant Azure AI Landing Zone: Policy Framework & Best Practices
As organizations accelerate their AI adoption on Microsoft Azure, governance, compliance, and security become critical pillars for success. Deploying AI workloads without a structured compliance framework can expose enterprises to data privacy issues, misconfigurations, and regulatory risks. To address this challenge, the Azure AI Landing Zone provides a scalable and secure foundation — bringing together Azure Policy, Blueprints, and Infrastructure-as-Code (IaC) to ensure every resource aligns with organizational and regulatory standards. The Azure Policy & Compliance Framework acts as the governance backbone of this landing zone. It enforces consistency across environments by applying policy definitions, initiatives, and assignments that monitor and remediate non-compliant resources automatically. This blog will guide you through: 🧭 The architecture and layers of an AI Landing Zone 🧩 How Azure Policy as Code enables automated governance ⚙️ Steps to implement and deploy policies using IaC pipelines 📈 Visualizing compliance flows for AI-specific resources What is Azure AI Landing Zone (AI ALZ)? AI ALZ is a foundational architecture that integrates core Azure services (ML, OpenAI, Cognitive Services) with best practices in identity, networking, governance, and operations. To ensure consistency, security, and responsibility, a robust policy framework is essential. Policy & Compliance in AI ALZ Azure Policy helps enforce standards across subscriptions and resource groups. You define policies (single rules), group them into initiatives (policy sets), and assign them with certain scopes & exemptions. Compliance reporting helps surface noncompliant resources for mitigation. In AI workloads, some unique considerations: Sensitive data (PII, models) Model accountability, logging, audit trails Cost & performance from heavy compute usage Preview features and frequent updates Scope This framework covers: Azure Machine Learning (AML) Azure API Management Azure AI Foundry Azure App Service Azure Cognitive Services Azure OpenAI Azure Storage Accounts Azure Databases (SQL, Cosmos DB, MySQL, PostgreSQL) Azure Key Vault Azure Kubernetes Service Core Policy Categories 1. Networking & Access Control Restrict resource deployment to approved regions (e.g., Europe only). Enforce private link and private endpoint usage for all critical resources. Disable public network access for workspaces, storage, search, and key vaults. 2. Identity & Authentication Require user-assigned managed identities for resource access. Disable local authentication; enforce Microsoft Entra ID (Azure AD) authentication. 3. Data Protection Enforce encryption at rest with customer-managed keys (CMK). Restrict public access to storage accounts and databases. 4. Monitoring & Logging Deploy diagnostic settings to Log Analytics for all key resources. Ensure activity/resource logs are enabled and retained for at least one year. 5. Resource-Specific Guardrails Apply built-in and custom policy initiatives for OpenAI, Kubernetes, App Services, Databases, etc. A detailed list of all policies is bundled and attached at the end of this blog. Be sure to check it out for a ready-to-use Excel file—perfect for customer workshops—which includes policy type (Standalone/Initiative), origin (Built-in/Custom), and more. Implementation: Policy-as-Code using EPAC To turn policies from Excel/JSON into operational governance, Enterprise Policy as Code (EPAC) is a powerful tool. EPAC transforms policy artifacts into a desired state repository and handles deployment, lifecycle, versioning, and CI/CD automation. What is EPAC & Why Use It? EPAC is a set of PowerShell scripts / modules to deploy policy definitions, initiatives, assignments, role assignments, exemptions. Enterprise Policy As Code (EPAC) It supports CI/CD integration (GitHub Actions, Azure DevOps) so policy changes can be treated like code. It handles ordering, dependency resolution, and enforcement of a “desired state” — any policy resources not in your repo may be pruned (depending on configuration). It integrates with Azure Landing Zones (including governance baseline) out of the box. References & Further Reading EPAC GitHub Repository Advanced Azure Policy management - Microsoft Learn [Advanced A...Framework] How to deploy Azure policies the DevOps way [How to dep...- Rabobank]
Madhur_Shukla
Feb 05, 2026 Place Azure Architecture Blog
1.8KViews
1like
2Comments
Architecting an Azure AI Hub-and-Spoke Landing Zone for Multi-Tenant Enterprises
A large enterprise customer adopting AI at scale typically needs three non‑negotiables in its AI foundation: End‑to‑end tenant isolation across network, identity, compute, and data Secure, governed traffic flow from users to AI services Transparent chargeback/showback for shared AI and platform services At the same time, the platform must enable rapid onboarding of new tenants or applications and scale cleanly from proof‑of‑concept to production. This article proposes an Azure Landing Zone–aligned architecture using a Hub‑and‑Spoke model, where: The AI Hub centralizes shared services and governance AI Spokes host tenant‑dedicated AI resources Application logic and AI agents run on AKS The result is a secure, scalable, and operationally efficient enterprise AI foundation. 1. Architecture goals & design principles Goals Host application logic and AI agents on Azure Kubernetes Service (AKS) as custom deployments instead of using agents under Azure AI Foundry Enforce strong tenant isolation across all layers Support cross chargeback and cost attribution Adopt a Hub‑and‑Spoke model with clear separation of shared vs. tenant‑specific services Design principles (Azure Landing Zone aligned) Azure Landing Zone (ALZ) guidance emphasizes: Separation of platform and workload subscriptions Management groups and policy inheritance Centralized connectivity using hub‑and‑spoke networking Policy‑driven governance and automation For infrastructure as code, ALZ‑aligned deployments typically use Bicep or Terraform, increasingly leveraging Azure Verified Modules (AVM) for consistency and long‑term maintainability. 2. Subscription & management group model A practical enterprise layout looks like this: Tenant Root Management Group o Platform Management Group Connectivity subscription (Hub VNet, Firewall, DNS, ExpressRoute/VPN) Management subscription (Log Analytics, Monitor) Security subscription (Defender for Cloud, Sentinel if required) o AI Hub Management Group AI Hub subscription (shared AI and governance services) o AI Spokes Management Group One subscription per tenant, business unit, or regulated boundary This structure supports enterprise‑scale governance while allowing teams to operate independently within well‑defined guardrails. 3. Logical architecture — AI Hub vs. AI Spoke AI Hub (central/shared services) The AI Hub acts as the governed control plane for AI consumption: Ingress & edge security: Azure Application Gateway with WAF (or Front Door for global scenarios) Central egress control: Azure Firewall with forced tunneling API governance: Azure API Management (private/internal mode) Shared AI services: Azure OpenAI (shared deployments where appropriate), safety controls Monitoring & observability: Azure Monitor, Log Analytics, centralized dashboards Governance: Azure Policy, RBAC, naming and tagging standards All tenant traffic enters through the hub, ensuring consistent enforcement of security, identity, and usage policies. AI Spoke (tenant‑dedicated services) Each AI Spoke provides a tenant‑isolated data and execution plane: Tenant‑dedicated storage accounts and databases Vector stores and retrieval systems (Azure AI Search with isolated indexes or services) AKS runtime for tenant‑specific AI agents and backend services Tenant‑scoped keys, secrets, and identities 4. Logical architecture diagram (Hub vs. Spoke) 5. Network architecture — Hub and Spoke 6. Tenant onboarding & isolation strategy Tenant onboarding flow Tenant onboarding is automated using a landing zone vending model: Request new tenant or application Provision a spoke subscription and baseline policies Deploy spoke VNet and peer to hub Configure private DNS and firewall routes Deploy AKS tenancy and data services Register identities and API subscriptions Enable monitoring and cost attribution This approach enables consistent, repeatable onboarding with minimal manual effort. Isolation by design Network: Dedicated VNets, private endpoints, no public AI endpoints Identity: Microsoft Entra ID with tenant‑aware claims and conditional access Compute: AKS isolation using namespaces, node pools, or dedicated clusters Data: Per‑tenant storage, databases, and vector indexes 7. Identity & access management (Microsoft Entra ID) Key IAM practices include: Central Microsoft Entra ID tenant for authentication and authorization Application and workload identities using managed identities Tenant context enforced at API Management and propagated downstream Conditional Access and least‑privilege RBAC This ensures zero‑trust access while supporting both internal and partner scenarios. 8. Secure traffic flow (end‑to‑end) User accesses application via Application Gateway + WAF Traffic inspected and routed through Azure Firewall API Management validates identity, quotas, and tenant context AKS workloads invoke AI services over Private Link Responses return through the same governed path This pattern provides full auditability, threat protection, and policy enforcement. 9. AKS multitenancy options Model When to use Characteristics Namespace per tenant Default Cost‑efficient, logical isolation Dedicated node pools Medium isolation Reduced noisy‑neighbor risk Dedicated AKS cluster High compliance Maximum isolation, higher cost Enterprises typically adopt a tiered approach, choosing the isolation level per tenant based on regulatory and risk requirements. 10. Cost management & chargeback model Tagging strategy (mandatory) tenantId costCenter application environment owner Enforced via Azure Policy across all subscriptions. Chargeback approach Dedicated spoke resources: Direct attribution via subscription and tags Shared hub resources: Allocated using usage telemetry o API calls and token usage from API Management o CPU/memory usage from AKS namespaces Cost data is exported to Azure Cost Management and visualized using Power BI to support showback and chargeback. 11. Security controls checklist Private endpoints for AI services, storage, and search No public network access for sensitive services Azure Firewall for centralized egress and inspection WAF for OWASP protection Azure Policy for governance and compliance 12. Deployment & automation Foundation: Azure Landing Zone accelerators (Bicep or Terraform) Workloads: Modular IaC for hub and spokes AKS apps: GitOps (Flux or Argo CD) Observability: Policy‑driven diagnostics and centralized logging 13. Final thoughts This Azure AI Landing Zone design provides a repeatable, secure, and enterprise‑ready foundation for any large customer adopting AI at scale. By combining: Hub‑and‑Spoke networking AKS‑based AI agents Strong tenant isolation FinOps‑ready chargeback Azure Landing Zone best practices organizations can confidently move AI workloads from experimentation to production—without sacrificing security, governance, or cost transparency. Disclaimer: While the above article discusses hosting custom agents on AKS alongside customer-developed application logic, the following sections focus on a baseline deployment model with no customizations. This approach uses Azure AI Foundry, where models and agents are fully managed by Azure, with centrally governed LLMs(AI Hub) hosted in Azure AI Foundry and agents deployed in a spoke environment. 🚀 Get Started: Building a Secure & Scalable Azure AI Platform To help you accelerate your Azure AI journey, Microsoft and the community provide several reference architectures, solution accelerators, and best-practice guides. Together, these form a strong foundation for designing secure, governed, and cost-efficient GenAI and AI workloads at scale. Below is a recommended starting path. 1️⃣ AI Landing Zone (Foundation) Purpose: Establish a secure, enterprise-ready foundation for AI workloads. The AI Landing Zone extends the standard Azure Landing Zone with AI-specific considerations such as: Network isolation and hub-spoke design Identity and access control for AI services Secure connectivity to data sources Alignment with enterprise governance and compliance 🔗 AI Landing Zone (GitHub): https://github.com/Azure/AI-Landing-Zones?tab=readme-ov-file 👉 Start here if you want a standardized baseline before onboarding any AI workloads. 2️⃣ AI Hub Gateway – Solution Accelerator Purpose: Centralize and control access to AI services across multiple teams or customers. The AI Hub Gateway Solution Accelerator helps you: Expose AI capabilities (models, agents, APIs) via a centralized gateway Apply consistent security, routing, and traffic controls Support both Chat UI and API-based consumption Enable multi-team or multi-tenant AI usage patterns 🔗 AI Hub Gateway Solution Accelerator: https://github.com/mohamedsaif/ai-hub-gateway-landing-zone?tab=readme-ov-file 👉 Ideal when you want a shared AI platform with controlled access and visibility. 3️⃣ Citadel Governance Hub (Advanced Governance) Purpose: Enforce strong governance, compliance, and guardrails for AI usage. The Citadel Governance Hub builds on top of the AI Hub Gateway and focuses on: Policy enforcement for AI usage Centralized governance controls Secure onboarding of teams and workloads Alignment with enterprise risk and compliance requirements 🔗 Citadel Governance Hub (README): https://github.com/Azure-Samples/ai-hub-gateway-solution-accelerator/blob/citadel-v1/README.md 👉 Recommended for regulated environments or large enterprises with strict governance needs. 4️⃣ AKS Cost Analysis (Operational Excellence) Purpose: Understand and optimize the cost of running AI workloads on AKS. AI platforms often rely on AKS for agents, inference services, and gateways. This guide explains: How AKS costs are calculated How to analyze node, pod, and workload costs Techniques to optimize cluster spend 🔗 AKS Cost Analysis: https://learn.microsoft.com/en-us/azure/aks/cost-analysis 👉 Use this early to avoid unexpected cost overruns as AI usage scales. 5️⃣ AKS Multi-Tenancy & Cluster Isolation Purpose: Safely run workloads for multiple teams or customers on AKS. This guidance covers: Namespace vs cluster isolation strategies Security and blast-radius considerations When to use shared clusters vs dedicated clusters Best practices for multi-tenant AKS platforms 🔗 AKS Multi-Tenancy & Cluster Isolation: https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-cluster-isolation 👉 Critical reading if your AI platform supports multiple teams, business units, or customers. 🧭 Suggested Learning Path If you’re new, follow this order: AI Landing Zone → build the foundation AI Hub Gateway → centralize AI access Citadel Governance Hub → enforce guardrails AKS Cost Analysis → control spend AKS Multi-Tenancy → scale securely
VimalVerma
Feb 03, 2026 Place Azure Architecture Blog
1.3KViews
1like
0Comments
From Large Semi-Structured Docs to Actionable Data: In-Depth Evaluation Approaches Guidance
Introduction Extracting structured data from large, semi-structured documents (the detailed solution implementation overview and architecture is provided in this tech community blog: From Large Semi-Structured Docs to Actionable Data: Reusable Pipelines with ADI, AI Search & OpenAI) demands a rigorous evaluation framework. The goal is to ensure our pipeline is accurate, reliable, and scalable before we trust it with mission-critical data. This framework breaks evaluation into clear phases, from how we prepare the document, to how we find relevant parts, to how we validate the final output. It provides metrics, examples, and best practices at each step, forming a generic pattern that can be applied to various domains. Framework Overview A very structured and stepwise approach for evaluation is given below: Establish Ground Truth & Sampling: Define a robust ground truth set and sampling method to fairly evaluate all parts of the document. Preprocessing Evaluation: Verify that OCR, chunking, and any structural augmentation (like adding headers) preserve all content and context. Labelling Evaluation: Check classification of sections/chunks by content based on topic/entity and ensure irrelevant data is filtered out without losing any important context. Retrieval Evaluation: Ensure the system can retrieve the right pieces of information (using search) with high precision@k and recall@k. Extraction Accuracy Evaluation: Measure how well the final structured data matches the expected values (field accuracy, record accuracy, overall precision/recall). Continuous Improvement Loop with SME: Use findings to retrain, tweak, and improve, enabling the framework to be reused for new documents and iterations. SMEs play a huge role in such scenarios. Detailed Guidance on Evaluation Below is a step-by-step, in-depth guide to evaluating this kind of IDP (Indelligent Document Processing) pipeline, covering both the overall system and its individual components: Establish Ground Truth & Sampling Why: Any evaluation is only as good as the ground truth it’s compared against. Start by assembling a reliable “source of truth” dataset for your documents. This often means manual labelling of some documents by domain experts (e.g., a legal team annotating key clauses in a contract, or accountants verifying invoice fields). Because manual curation is expensive, be strategic in what and how we sample. Ground Truth Preparation: Identify the critical fields and sections we need to extract, and create an annotated set of documents with those values marked correct. For example, if processing financial statements, we might mark the ground truth values for Total Assets, Net Income, Key Ratios, etc. This ground truth should be the baseline to measure accuracy against. Although creating it is labour-intensive, it yields a precise benchmark for model performance. Stratified Sampling: Documents like contracts or policies have diverse sections. To evaluate thoroughly, use stratified sampling – ensure your test set covers all major content types and difficulty levels. For instance, if 15% of pages in a set of contracts are annexes or addendums, then ~15% of your evaluation pages should come from annexes, not just the main body. This prevents the evaluation from overlooking challenging or rare sections. In practice, we might partition a document by section type (clauses, tables, schedules, footnotes) and sample a proportion from each. This way, metrics reflect performance on each type of content, not just the easiest portions. Multi-Voter Agreement (Consensus): It’s often helpful to have multiple automated voters on the outputs before involving humans. For example, suppose we extracted an invoice amount; we can have: A regex/format checker/fuzzy matching voter A cross-field logic checker/embedding based matching voter An ML model confidence score/LLM as a judge vote If all signals are strong, we label that extraction as Low Risk; if they conflict, mark it High Risk for human review. By tallying such “votes”, we create tiers of confidence. Why? Because in many cases, a large portion of outputs will be obviously correct (e.g., over 80% might have unanimous high confidence), and we can safely assume those are right, focusing manual review on the remainder. This strategy effectively reduces the human workload while maintaining quality. Preprocessing Evaluation Before extracting meaning, make sure the raw text and structure are captured correctly. Any loss here breaks the whole pipeline. Key evaluation checks: OCR / Text Extraction Accuracy Character/Error Rate: Sample pages to see how many words are recognized correctly (use per-word confidence to spot issues). Layout Preservation: Ensure reading order isn’t scrambled, especially in multi-column pages or footnotes. Content Coverage: Verify every sentence from a sample page appears in the extracted text. Missing footers or sidebars count as gaps. Chunking Completeness: Combined chunks should reconstruct the full document. Word counts should match. Segment Integrity: Chunks should align to natural boundaries (paragraphs, tables). Track a metric like “95% clean boundaries.” Context Preservation: If a table or section spans chunks, mark relationships so downstream logic sees them as connected. Multi-page Table Handling Header Insertion Accuracy: Validate that continued pages get the correct header (aim for high 90% to maintain context across documents). No False Headers: Ensure new tables aren’t mistakenly treated as continuations. Track a False Continuation Rate and push it to near zero. Practical Check: Sample multi-page tables across docs to confirm consistent extraction and no missed rows. Structural Links / References Link Accuracy: Confirm references (like footnotes or section anchors) map to the right targets (e.g., 98%+ correct). Ontology / Category Coverage: If content is pre-grouped, check precision (no mis-grouping) and recall (nothing left uncategorized). Implication The goal is to ensure the pre-processed chunks are a faithful, complete, and structurally coherent representation of the original document. Metrics like content coverage, boundary cleanliness, and header accuracy help catch issues early. Fixing them here saves significant downstream debugging. Labelling Evaluation – “Did we isolate the right pieces?” Once we chunk the document, we label those chunks (with ML or rules) to map them to the right entities and throw out the noise. Think of this step as sorting useful clauses from filler. Section/Entity Labelling Accuracy Treat labelling as a multi-class or multi-label classification problem. Precision (Label Accuracy): Of the chunks we labelled as X, how many were actually X? Example: Model tags 40 chunks as “Financial Data.” If 5 are wrong, precision is 87.5. High precision avoids polluting a category (topic/entity) with junk. Recall (Coverage): Of the chunks that truly belong to category X, how many did we catch? Example: Ground truth has 50 Financial Data chunks, model finds 45. Recall is 90%. High recall prevents missing important sections. Example: A model labels paper sections as Introduction, Methods, Results, etc. It marks 100 sections as Results and 95 are correct (95% precision). It misses 5 actual Results (slightly lower recall). That’s acceptable if downstream steps can still recover some items. But low precision means the labelling logic needs tightening. Implication Low precision means wrong info contaminates the category. Low recall means missing crucial bits. Use these metrics to refine definitions or adjust the labelling logic. Don’t just report one accuracy number; precision and recall per label tell the real story. Retrieval Evaluation – “Can we find the right info when we ask?” Many document pipelines use retrieval to narrow a huge file down to the few chunks most likely to contain the answer corresponding to a topic/entity. If we need a “termination date,” we first fetch chunks about dates or termination, then extract from those. Retrieval must be sharp, or everything downstream suffers. Precision@K How many of the top K retrieved chunks are actually relevant? If we grab 5 chunks for “Key Clauses” and 4 are correct, Precision@5 is 80%. We usually set K to whatever the next stage consumes (3 or 5). High precision keeps extraction clean. Average it across queries or fields. Critical fields may demand very high Precision@K. Recall@K Did we retrieve enough of the relevant chunks? If there are 2 relevant chunks in the doc but the top 5 results include only 1, recall is 50%. Good recall means we aren’t missing mentions in other sections or appendices. Increasing K improves recall but can dilute precision. Tune both together. Ranking Quality (MRR, NDCG) If order matters, use rank-aware metrics. MRR: Measures how early the first relevant result appears. Perfect if it’s always at rank 1. NDCG@K: Rewards having the most relevant chunks at the top. Useful when relevance isn’t binary. Most pipelines can get away with Precision@K and maybe MRR. Implication Test 50 QA pairs from policy documents, retrieving 3 passages per query. Average Precision@3: 85%. Average Recall@3: 92%. MRR: 0.8. Suppose, we notice “data retention” answers appear in appendices that sometimes rank low. We increase K to 5 for that query type. Precision@3 rises to 90%, and Recall@5 hits roughly 99%. Retrieval evaluation is a sanity check. If retrieval fails, extraction recall will tank no matter how good the extractor is. Measure both so we know where the leak is. Also keep an eye on latency and cost if fancy re-rankers slow things down. Extraction Accuracy Evaluation – “Did we get the right answers?” Look at each field and measure how often we got the right value. Precision: Of the values we extracted, what percent are correct? Use exact match or a lenient version if small format shifts don’t matter. Report both when useful. Recall: Out of all ground truth values, how many did we actually extract? Per-field breakdown: Some fields will be easy (invoice numbers, dates), others messy (vendor names, free text). A simple table makes this obvious and shows where to focus improvements. Error Analysis Numbers don’t tell the whole story. Look at patterns: OCR mix-ups Bad date or amount formats Wrong chunk retrieved upstream Misread tables Find the recurring mistakes. That’s where the fixes live. Holistic Metrics If needed, compute overall precision/recall across all extracted fields. But per-field and record-level are usually what matter to stakeholders. Implication Precision protects against wrong entries. Recall protects against missing data. Choose your balance based on risk: If false positives hurt more (wrong financial numbers), favour precision. If missing items hurts more (missing red-flag clauses), favour recall. Continuous Improvement Loop with SME Continuous improvement means treating evaluation as an ongoing feedback loop rather than a one-time check. Each phase’s errors point to concrete fixes, and every fix is re-measured to ensure accuracy moves in the right direction without breaking other components. The same framework also supports A/B testing alternative methods and monitoring real production data to detect drift or new document patterns. Because the evaluation stages are modular, they generalize well across domains such as contracts, financial documents, healthcare forms, or academic papers with only domain-specific tweaks. Over time, this creates a stable, scalable and measurable path toward higher accuracy, better robustness, and easier adaptation to new document types. Conclusion Building an end-to-end evaluation framework isn’t just about measuring accuracy, it’s about creating trust in the entire pipeline. By breaking the process into clear phases, defining robust ground truth, and applying precision/recall-driven metrics at every stage, we ensure that document processing systems are reliable, scalable, and adaptable. This structured approach not only highlights where improvements are needed but also enables continuous refinement through SME feedback and iterative testing. Ultimately, such a framework transforms evaluation from a one-time exercise into a sustainable practice, paving the way for higher-quality outputs across diverse domains.
anishganguli
Dec 29, 2025 Place Azure Architecture Blog
366Views
2likes
1Comment
Azure OpenAI Landing Zone reference architecture
In this article, delve into the synergy of Azure Landing Zones and Azure OpenAI Service, building a secure and scalable AI environment. Unpack the Azure OpenAI Landing Zone architecture, which integrates numerous Azure services for optimal AI workloads. Explore robust security measures and the significance of monitoring for operational success. This journey of deploying Azure OpenAI evolves alongside Azure's continual innovation.
FreddyAyala
Dec 22, 2025 Place Azure Architecture Blog
213KViews
44likes
21Comments
From Large Semi-Structured Docs to Actionable Data: Reusable Pipelines with ADI, AI Search & OpenAI
Problem Space Large semi-structured documents such as contracts, invoices, hospital tariff/rate cards multi-page reports, and compliance records often carry essential information that is difficult to extract reliably with traditional approaches. Their layout can span several pages, the structure is rarely consistent, and related fields may appear far apart even though they must be interpreted together. This makes it hard not only to detect the right pieces of information but also to understand how those pieces relate across the document. LLM can help, but when documents are long and contain complex cross-references, they may still miss subtle dependencies or generate hallucinated information. That becomes risky in environments where small errors can cascade into incorrect decisions. At the same time, these documents don’t change frequently, while the extracted data is used repeatedly by multiple downstream systems at scale. Because of this usage pattern, a RAG-style pipeline is often not ideal in terms of cost, latency, or consistency. Instead, organizations need a way to extract data once, represent it consistently, and serve it efficiently in a structured form to a wide range of applications, many of which are not conversational AI systems. At this point, data stewardship becomes critical, because once information is extracted, it must remain accurate, governed, traceable, and consistent throughout its lifecycle. When the extracted information feed compliance checks, financial workflows, risk models, or end-user experiences, the organization must ensure that the data is not just captured correctly but also maintained with proper oversight as it moves across systems. Any extraction pipeline that cannot guarantee quality, reliability, and provenance introduces long-term operational risk. The core problem, therefore, is finding a method that handles the structural and relational complexity of large semi-structured documents, minimizes LLM hallucination risk, produces deterministic results, and supports ongoing data stewardship so that the resulting structured output stays trustworthy and usable across the enterprise. Target Use Cases The potential applications for an Intelligent Document Processing (IDP) pipeline differ across industries. Several industry-specific use cases are provided as examples to guide the audience in conceptualising and implementing solutions tailored to their unique requirements. Hospital Tariff Digitization for Tariff-Invoice Reconciliation in Health Insurance Document types: Hospital tariff/rate cards, annexures/guidelines, pre-authorization guidelines etc. Technical challenge: Charges for the same service might appear under different sections or for different hospital room types across different versions of tariff/rate cards. Table + free text mix, abbreviations, and cross-page references. Downstream usage: Reimbursement orchestration, claims adjudication Commercial Loan Underwriting in Banking Document types: Balance sheets, cash-flow statements, auditor reports, collateral documents. Technical Challenge: Ratios and covenants must be computed from fields located across pages. Contextual dependencies: “Net revenue excluding exceptional items” or footnotes that override values. Downstream usage: Loan decisioning models, covenant monitoring, credit scoring. Procurement Contract Intelligence in Manufacturing Document types: Vendor agreements, SLAs, pricing annexures. Technical Challenge: Pricing rules defined across clauses that reference each other. Penalty and escalation conditions hidden inside nested sections. Downstream usage: Automated PO creation, compliance checks. Regulatory Compliance Extraction Document types: GDPR/HIPAA compliance docs, audit reports. Technical Challenge: Requirements and exceptions buried across many sections. Extraction must be deterministic since compliance logic is strict. Downstream usage: Rule engines, audit workflows, compliance checklist. Solution Approaches Problem Statement Across industries from finance and healthcare to legal and compliance, large semi-structured documents serve as the single source of truth for critical workflows. These documents often span hundreds of pages, mixing free text, tables, and nested references. Before any automation can validate transactions, enforce compliance, or perform analytics, this information must be transformed into a structured, machine-readable format. The challenge isn’t just size; it’s complexity. Rules and exceptions are scattered, relationships span multiple sections, and formatting inconsistencies make naive parsing unreliable. Errors at this stage ripple downstream, impacting reconciliation, risk models, and decision-making. In short, the fidelity of this digitization step determines the integrity of every subsequent process. Solving this problem requires a pipeline that can handle structural diversity, preserve context, and deliver deterministic outputs at scale. Challenges There are many challenges which can arise while solving for such large complex documents. The documents can have ~200-250 pages. The documents structures and layouts can be extremely complex in nature. A document or a page may contain a mix of various layouts like tables, text blocks, figures etc. Sometimes a single table can stretch across multiple pages, but only the first page contains the table header, leaving the remaining pages without column labels. A topic on one page may be referenced from a different page, so there can be complex inter-relationship amongst different topics in the same documents which needs to be structured in a machine-readable format. The document can be semi-structured as well (some parts are structured; some parts are unstructured or free text) The downstream applications might not always be AI-assisted (it can be core analytics dashboard or existing enterprise legacy system), so the structural storage of the digitized items from the documents need to be very well thought out before moving ahead with the solution. Motivation Behind High Level Approach A larger document (number of pages ~200) needs to be divided into smaller chunks so that it becomes readable and digestible (within context length) for the LLM. To make the content/input of the LLM truly context-aware, the references must be maintained across pages (for example, table headers of long and continuous tables need to be injected to those chunks which would have the tables without the headers). If a pre-defined set of topics/entities are being covered in the documents in consideration, then topic/entity-wise information needs to be extracted for making the system truly context-aware. Different chunks can cover similar topic/entity which becomes a search problem The retrieval needs to happen for every topic/entity so that all information related to one topic/entity are in a single place and as a result the downstream applications become efficient, scalable and reliable over time. Sample Architecture and Implementation Let’s take a possible approach to demonstrate the feasibility of the following architecture, building on the motivation outlined above. The solution divides a large, semi-structured document into manageable chunks, making it easier to maintain context and references across pages. First, the document is split into logical sections. Then, OCR and layout extraction capture both text and structure, followed by structure analysis to preserve semantic relationships. Annotated chunks are labeled and grouped by entity, enabling precise extraction of items such as key-value pairs or table data. As a result, the system efficiently transforms complex documents into structured, context-rich outputs ready for downstream analytics and automation. Architecture Components The key elements of the architecture diagram include components 1-6, which are code modules. Components 7 and 8 represent databases that store data chunks and extracted items, while component 9 refers to potential downstream systems that will use the structured data obtained from extraction. Chunking: Break documents into smaller, logical sections such as pages or content blocks. Enables parallel processing and improves context handling for large files. Technology: Python-based chunking logic using pdf2image and PIL for image handling. OCR & Layout Extraction: Convert scanned images into machine-readable text while capturing layout details like bounding boxes, tables, and reading order for structural integrity. Technology: Azure Document Intelligence or Microsoft Foundry Content Understanding Prebuilt Layout model combining OCR with deep learning for text, tables, and structure extraction. Context Aware Structural Analysis: Analyse the extracted layout to identify document components such as headers, paragraphs, and tables. Preserves semantic relationships for accurate interpretation. Technology: Custom Python logic leveraging OCR output to inject missing headers, summarize layout (row/column counts, sections per page). Labelling: Assign entity-based labels to chunks according to predefined schema or SME input. Helps filter irrelevant content and focus on meaningful sections. Technology: Azure OpenAI GPT-4.1-mini with NLI-style prompts for multi-class classification. Entity-Wise Grouping: Organize chunks by entity type (e.g., invoice number, total amount) for targeted extraction. Reduces noise and improves precision in downstream tasks. Technology: Azure AI Search with Hybrid Search and Semantic Reranking for grouping relevant chunks. Item Extraction: Extract specific values such as key-value pairs, line items, or table data from grouped chunks. Converts semi-structured content into structured fields. Technology: Azure OpenAI GPT-4.1-mini with Set of Marking style prompts using layout clues (row × column, headers, OCR text). Interim Chunk Storage: Store chunk-level data including OCR text, layout metadata, labels, and embeddings. Supports traceability, semantic search, and audit requirements. Technology: Azure AI Search for chunk indexing and Azure OpenAI Embedding models for semantic retrieval. Document Store: Maintain final extracted items with metadata and bounding boxes. Enables quick retrieval, validation, and integration with enterprise systems. Technology: Azure Cosmos DB, Azure SQL DB, Azure AI Search, or Microsoft Fabric depending on downstream needs (analytics, APIs, LLM apps). Downstream Integration: Deliver structured outputs (JSON, CSV, or database records) to business applications or APIs. Facilitates automation and analytics across workflows. Technology: REST APIs, Azure Functions, or Data Pipelines integrated with enterprise systems. Algorithms Consider these key algorithms when implementing the components above: Structural Analysis – Inject headers: Detect tables page by page; compare the last row of a table on page i with the first row of a table on page i+1, if column counts match and ≥4/5 style features (Font Weight, Background Colour, Font Style, Foreground Colour, Similar Font Family) match, mark it as a continuous table (header missing) and inject the previous page’s header into the next page’s table, repeating across pages. Labelling – Prompting Guide: Run NLI checks per SOC chunk image (ground on OCR text) across N curated entity labels, return {decision ∈ {ENTAILED, CONTRADICTED, NEUTRAL}, confidence ∈ [0,1]}, and output only labels where decision = ENTAILED and confidence > 0.7. Entity-Wise Grouping – Querying Chunks per Entity & Top‑50 Handling: Construct the query from the entity text and apply hybrid search with label filters for Azure AI Search, starting with chunks where the target label is sole, then expanding to observed co‑occurrence combinations under a cap to prevent explosion. If label frequency >50, run staged queries (sole‑label → capped co‑label combos); otherwise use a single hybrid search with semantic reranking, merge results and deduplicate before scoring. Entity-Wise Grouping – Chunk to Entity relevance scoring: For each retrieved chunk, split text into spans; compute cosine similarities to the entity text and take the mean s. Boost with a gated nonlinearity b=σ(k(s-m))⋅s. where σ is sigmoid function and k,m are tunables to emphasize mid-range relevance while suppressing very low s. Min–max normalize the re-ranker score r → r_norm; compute the final score F=α*b+(1-α)*r_norm, and keep the chunk iff F≥τ. Item Extraction – Prompting Guide: Provide the chunk image as an input and ground on visual structure (tables, headers, gridlines, alignment, typography) and document structural metadata to segment and align units; reconcile ambiguities via OCR extracted text, then enumerate associations by positional mapping (header ↔ column, row ↔ cell proximity) and emit normalized objects while filtering narrative/policy text by layout and pattern cues. Deployment at Scale There are several ways to implement a document extraction pipeline, each with its own pros and cons. The best deployment model depends on scenario requirements. Below are some common approaches with their advantages and disadvantages. Host as REST API Pros: Enables straightforward creation, customization, and deployment across scalable compute services such as Azure Kubernetes Service. Cons: Processing time and memory usage scale with document size and complexity, potentially requiring multiple iterations to optimize performance. Deploy as Azure Machine Learning (ML) Pipeline Pros: Facilitates efficient time and memory management, as Azure ML supports processing large datasets at scale. Cons: The pipeline may be more challenging to develop, customize, and maintain. Deploy as Azure Databricks Job Pros: Offers robust time and memory management similar to Azure ML, with advanced features such as Data Autoloader for detecting data changes and triggering pipeline execution. Cons: The solution is highly tailored to Azure Databricks and may have limited customization options. Deploy as Microsoft Fabric Pipeline Pros: Provides capabilities comparable to Azure ML and Databricks, and features like Fabric Activator replicate Databricks Autoloader functionality. Cons: Presents similar limitations found in Azure ML and Azure Databricks approaches. Each method should be carefully evaluated to ensure alignment with technical and operational requirements. Evaluation Objective: The aim is to evaluate how accurately a document extraction pipeline extracts information by comparing its output with manually verified data. Approach: Documents are split into sections, labelled, and linked to relevant entities; then, AI tools extract key items through the outlined pipeline mentioned above. The extracted data is checked against expert-curated records using both exact and approximate matching techniques. Key Metrics: Individual Item Attribute Match: Assesses the system’s ability to identify specific item attributes using strict and flexible comparison methods. Combined Item Attribute Match: Evaluates how well multiple attributes are identified together, considering both exact and fuzzy matches. Precision Calculation: Precision for each metric reflects the proportion of correctly matched entries compared to all reference entries. Findings for a real-world scenario: Fuzzy matching of item key attributes yields high precision (over 90%), but accuracy drops for key attribute combinations (between 43% and 48%). These results come from analysis across several datasets to ensure reliability. How This Addresses the Problem Statement The sample architecture described integrates sectioning, entity linking, and attribute extraction as foundational steps. Each extracted item is then evaluated against expert-curated datasets using both strict (exact) and flexible (fuzzy) matching algorithms. This approach directly addresses the problem statement by providing measurable metrics, such as individual and combined attribute match rates and precision calculations, that quantify the system’s reliability and highlight areas for improvement. Ultimately, this methodology ensures that the pipeline’s output is systematically validated, and its strengths and limitations are clearly understood in real-world contexts. Plausible Alternative Approaches No single approach fits every use case; the best method depends on factors like document complexity, structure, sensitivity, and length as well as the downstream application types, Consider these alternative approaches for different scenarios. Using Azure OpenAI alone Article: Best Practices for Structured Extraction from Documents Using Azure OpenAI Using Azure OpenAI + Azure Document Intelligence + Azure AI Search: RAG like solution Article 1: Document Field Extraction with Generative AI Article 2: Complex Data Extraction using Document Intelligence and RAG Article 3: Design and develop a RAG solution Using Azure OpenAI + Azure Document Intelligence + Azure AI Search: Non-RAG like solution Article: Using Azure AI Document Intelligence and Azure OpenAI to extract structured data from documents GitHub Repository: Content processing solution accelerator Conclusion Intelligent Document Processing for large semi-structured documents isn’t just about extracting data, it’s about building trust in that data. By combining Azure Document Intelligence for layout-aware OCR with OpenAI models for contextual understanding, we create a well thought out in-depth pipeline that is accurate, scalable, and resilient against complexity. Chunking strategies ensure context fits within model limits, while header injection and structural analysis preserve relationships across pages to make it context-aware. Entity-based grouping and semantic retrieval transform scattered content into organized, query-ready data. Finally, rigorous evaluation with scalable ground truth strategy roadmap, using precision, recall, and fuzzy matching, closes the loop, ensuring reliability for downstream systems. This pattern delivers more than automation; it establishes a foundation for compliance, analytics, and AI-driven workflows at enterprise scale. In short, it’s a blueprint for turning chaotic document into structured intelligence, efficient, governed, and future-ready for any kind of downstream applications. End-to-End Evaluation Approaches Guidance Given the complexity of this system, it should undergo a thorough end-to-end evaluation to ensure correctness, robustness, and performance across the pipeline. Continuous monitoring and observability of these metrics will enable iterative improvements and help the system scale reliably as requirements evolve. If you would like to read more about the end-to-end evaluation approaches guidance, please refer to our tech community blog: From Large Semi-Structured Docs to Actionable Data: In-Depth Evaluation Approaches Guidance. References Azure Content Understanding in Foundry Tools Azure Document Intelligence in Foundry Tools Azure OpenAI in Microsoft Foundry models Azure AI Search Azure Machine Learning (ML) Pipelines Azure Databricks Job Microsoft Fabric Pipeline
anishganguli
Dec 18, 2025 Place Azure Architecture Blog
660Views
0likes
0Comments
Streamline Azure NetApp Files Management—Right from Your IDE
The Azure NetApp Files VS Code Extension is designed to streamline storage provisioning and management directly within the developer’s IDE. Traditional workflows often require extensive portal navigation, manual configuration, and policy management, leading to inefficiencies and context switching. The extension addresses these challenges by enabling AI-powered automation through natural language commands, reducing provisioning time from hours to minutes while minimizing errors and improving compliance. Key capabilities include generating production-ready ARM templates, validating resources, and delivering optimization insights—all without leaving the coding environment.
GeertVanTeylingen
Dec 15, 2025 Place Azure Architecture Blog
230Views
0likes
0Comments
How Azure NetApp Files Object REST API powers Azure and ISV Data and AI services – on YOUR data
This article introduces the Azure NetApp Files Object REST API, a transformative solution for enterprises seeking seamless, real-time integration between their data and Azure's advanced analytics and AI services. By enabling direct, secure access to enterprise data—without costly transfers or duplication—the Object REST API accelerates innovation, streamlines workflows, and enhances operational efficiency. With S3-compatible object storage support, it empowers organizations to make faster, data-driven decisions while maintaining compliance and data security. Discover how this new capability unlocks business potential and drives a new era of productivity in the cloud.
GeertVanTeylingen
Dec 15, 2025 Place Azure Architecture Blog
1KViews
0likes
0Comments
Accelerating HPC and EDA with Powerful Azure NetApp Files Enhancements
High-Performance Computing (HPC) and Electronic Design Automation (EDA) workloads demand uncompromising performance, scalability, and resilience. Whether you're managing petabyte-scale datasets or running compute intensive simulations, Azure NetApp Files delivers the agility and reliability needed to innovate without limits.
GeertVanTeylingen
Dec 15, 2025 Place Azure Architecture Blog
747Views
1like
0Comments
How to Modernise a Microsoft Access Database (Forms + VBA) to Node.JS, OpenAPI and SQL Server
Microsoft Access has played a significant role in enterprise environments for over three decades. Released in November 1992, its flexibility and ease of use made it a popular choice for organizations of all sizes—from FTSE250 companies to startups and the public sector. The platform enables rapid development of graphical user interfaces (GUIs) paired with relational databases, allowing users to quickly create professional-looking applications. Developers, data architects, and power users have all leveraged Microsoft Access to address various enterprise challenges. Its integration with Microsoft Visual Basic for Applications (VBA), an object-based programming language, ensured that Access solutions often became central to business operations. Unsurprisingly, modernizing these applications is a common requirement in contemporary IT engagements as thse solutions lead to data fragmentation, lack of integration into master data systems, multiple copies of the same data replicated across each access database and so on. At first glance, upgrading a Microsoft Access application may seem simple, given its reliance on forms, VBA code, queries, and tables. However, substantial complexity often lurks beneath this straightforward exterior. Modernization efforts must consider whether to retain the familiar user interface to reduce staff retraining, how to accurately re-implement business logic, strategies for seamless data migration, and whether to introduce an API layer for data access. These factors can significantly increase the scope and effort required to deliver a modern equivalent, especially when dealing with numerous web forms, making manual rewrites a daunting task. This is where GitHub Copilot can have a transformative impact, dramatically reducing redevelopment time. By following a defined migration path, it is possible to deliver a modernized solution in as little as two weeks. In this blog post, I’ll walk you through each tier of the application and give you example prompts used at each stage. 🏛️Architecture Breakdown: The N-Tier Approach Breaking down the application architecture reveals a classic N-Tier structure, consisting of a presentation layer, business logic layer, data access layer, and data management layer. 💫First-Layer Migration: Migrating a Microsoft Access Database to SQL Server The migration process began with the database layer, which is typically the most straightforward to move from Access to another relational database management system (RDBMS). In this case, SQL Server was selected to leverage the SQL Server Migration Assistant (SSMA) for Microsoft Access—a free tool from Microsoft that streamlines database migration to SQL Server, Azure SQL Database, or Azure SQL Database Managed Instance (SQLMI). While GitHub Copilot could generate new database schemas and insert scripts, the availability of a specialized tool made the process more efficient. Using SSMA, the database was migrated to SQL Server with minimal effort. However, it is important to note that relationships in Microsoft Access may lack explicit names. In such cases, SSMA appends a GUID or uses one entirely to create unique foreign key names, which can result in confusing relationship names post-migration. Fortunately, GitHub Copilot can batch-rename these relationships in the generated SQL scripts, applying more meaningful naming conventions. By dropping and recreating the constraints, relationships become easier to understand and maintain. SSMA handles the bulk of the migration workload, allowing you to quickly obtain a fully functional SQL Server database containing all original data. In practice, renaming and recreating constraints often takes longer than the data migration itself. Prompt Used: # Context I want to refactor the #file:script.sql SQL script. Your task is to follow the below steps to analyse it and refactor it according to the specified rules. You are allowed to create / run any python scripts or terminal commands to assist in the analysis and refactoring process. # Analysis Phase Identify: Any warning comments Relations between tables Foreign key creation References to these foreign keys in 'MS_SSMA_SOURCE' metadata # Refactor Phase Refactor any SQL matching the following rules: - Create a new script file with the same name as the original but with a `.refactored.sql` extension - Rename any primary key constraints to follow the format PK_{table_name}_{column_name} - Rename any foreign key constraints like [TableName]${GUID} to FK_{child_table}_{parent_table} - Rename any indexes like [TableName]${GUID} to IDX_{table_name}_{column_name} - Ensure any updated foreign keys are updated elsewhere in the script - Identify which warnings flagged by the migration assistant need addressed # Summary Phase Create a summary file in markdown format with the following sections: - Summary of changes made - List of warnings addressed - List of foreign keys renamed - Any other relevant notes 🤖Bonus: Introduce Database Automation and Change Management As we now had a SQL database, we needed to consider how we would roll out changes to the database and we could introduce a formal tool to cater for this within the solution which was Liquibase. Prompt Used: # Context I want to refactor #file:db.changelog.xml. Your task is to follow the below steps to analyse it and refactor it according to the specified rules. You are allowed to create / run any python scripts or terminal commands to assist in the analysis and refactoring process. # Analysis Phase Analyse the generated changelog to identify the structure and content. Identify the tables, columns, data types, constraints, and relationships present in the database. Identify any default values, indexes, and foreign keys that need to be included in the changelog. Identify any vendor specific data types / fucntions that need to be converted to common Liquibase types. # Refactor Phase DO NOT modify the original #file:db.changelog.xml file in any way. Instead, create a new changelog file called `db.changelog-1-0.xml` to store the refactored changesets. The new file should follow the structure and conventions of Liquibase changelogs. You can fetch https://docs.liquibase.com/concepts/data-type-handling.html to get available Liquibase types and their mappings across RDBMS implementations. Copy the original changesets from the `db.changelog.xml` file into the new file Refactor the changesets according to the following rules: - The main changelog should only include child changelogs and not directly run migration operations - Child changelogs should follow the convention db.changelog-{version}.xml and start at 1-0 - Ensure data types are converted to common Liquibase data types. For example: - `nvarchar(max)` should be converted to `TEXT` - `datetime2` should be converted to `TIMESTAMP` - `bit` should be converted to `BOOLEAN` - Ensure any default values are retained but ensure that they are compatible with the liquibase data type for the column. - Use standard SQL functions like `CURRENT_TIMESTAMP` instead of vendor-specific functions. - Only use vendor specific data types or functions if they are necessary and cannot be converted to common Liquibase types. These must be documented in the changelog and summary. Ensure that the original changeset IDs are preserved for traceability. Ensure that the author of all changesets is "liquibase (generated)" # Validation Phase Validate the new changelog file against the original #file:db.changelog.xml to ensure that all changesets are correctly refactored and that the structure is maintained. Confirm no additional changesets are added that were not present in the original changelog. # Finalisation Phase Provide a summary of the changes made in the new changelog file. Document any vendor specific data types or functions that were used and why they could not be converted to common Liquibase types. Ensure the main changelog file (`db.changelog.xml`) is updated to include the new child changelog file (`db.changelog-1-0.xml`). 🤖Bonus: Synthetic Data Generation Since the legacy system lacked synthetic data for development or testing, GitHub Copilot was used to generate fake seed data. Care was taken to ensure all generated data was clearly fictional—using placeholders like ‘Fake Name’ and ‘Fake Town’—to avoid any confusion with real-world information. This step greatly improved the maintainability of the project, enabling developers to test features without handling sensitive or real data. 💫Second-Layer Migration: OpenAPI Specifications With data migration complete, the focus shifted to implementing an API-driven approach for data retrieval. Adopting modern standards, OpenAPI specifications were used to define new RESTful APIs for creating, reading, updating, and deleting data. Because these APIs mapped directly to underlying entities, GitHub Copilot efficiently generated the required endpoints and services in Node.js, utilizing a repository pattern. This approach not only provided robust APIs but also included comprehensive self-describing documentation, validation at the API boundary, automatic error handling, and safeguards against invalid data reaching business logic or database layers. 💫Third-Layer Migration: Business Logic The business logic, originally authored in VBA, was generally straightforward. GitHub Copilot translated this logic into its Node.js equivalent and created corresponding tests for each method. These tests were developed directly from the code, adding a layer of quality assurance that was absent in the original Access solution. The result was a set of domain services mirroring the functionality of their VBA predecessors, successfully completing the migration of the third layer. At this stage, the project had a new database, a fresh API tier, and updated business logic, all conforming to the latest organizational standards. The final major component was the user interface, an area where advances in GitHub Copilot’s capabilities became especially evident. 💫Fourth Layer: User Interface The modernization of the Access Forms user interface posed unique challenges. To minimize retraining requirements, the new system needed to retain as much of the original layout as possible, ensuring familiar placement of buttons, dropdowns, and other controls. At the same time, it was necessary to meet new accessibility standards and best practices. Some Access forms were complex, spanning multiple tabs and containing numerous controls. Manually describing each interface for redevelopment would have been time-consuming. Fortunately, newer versions of GitHub Copilot support image-based prompts, allowing screenshots of Access Forms to serve as context. Using these screenshots, Copilot generated Government Digital Service Views that closely mirrored the original application while incorporating required accessibility features, such as descriptive labels and field selectors. Although the automatically generated UI might not fully comply with all current accessibility standards, prompts referencing WCAG guidelines helped guide Copilot’s improvements. The generated interfaces provided a strong starting point for UX engineers to further refine accessibility and user experience to meet organizational requirements. 🤖Bonus: User Story Generation from the User Interface For organizations seeking a specification-driven development approach, GitHub Copilot can convert screenshots and business logic into user stories following the “As a … I want to … So that …” format. While not flawless, this capability is invaluable for systems lacking formal requirements, giving business analysts a foundation to build upon in future iterations. 🤖Bonus: Introducing MongoDB Towards the end of the modernization engagement, there was interest in demonstrating migration from SQL Server to MongoDB. GitHub Copilot can facilitate this migration, provided it is given adequate context. As with all NoSQL databases, the design should be based on application data access patterns—typically reading and writing related data together. Copilot’s ability to automate this process depends on a comprehensive understanding of the application’s data relationships and patterns. # Context The `<business_entity>` entity from the existing system needs to be added to the MongoDB schema. You have been provided with the following: - #file:documentation - System documentation to provide domain / business entity context - #file:db.changelog.xml - Liquibase changelog for SQL context - #file:mongo-erd.md - Contains the current Mongo schema Mermaid ERD. Create this if it does not exist. - #file:stories - Contains the user stories that will the system will be built around # Analysis Phase Analyse the available documentation and changelog to identify the structure, relationships, and business context of the `<business_entity>`. Identify: - All relevant data fields and attributes - Relationships with other entities - Any specific data types, constraints, or business rules Determine how this entity fits into the overall MongoDB schema: - Should it be a separate collection? - Should it be embedded in another document? - Should it be a reference to another collection for lookups or relationships? - Explore the benefit of denormalization for performance and business needs Consider the data access patterns and how this entity will be used in the application. # MongoDB Schema Design Using the analysis, suggest how the `<business_entity>` should be represented in MongoDB: - The name of the MongoDB collection that will represent this entity - List each field in the collection, its type, any constraints, and what it maps to in the original business context - For fields that are embedded, document the parent collection and how the fields are nested. Nested fields should follow the format `parentField->childField`. - For fields that are referenced, document the reference collection and how the lookup will be performed. - Provide any additional notes on indexing, performance considerations, or specific MongoDB features that should be used - Always use pascal case for collection names and camel case for field names # ERD Creation Create or update the Mermaid ERD in `mongo-erd.md` to include the results of your analysis. The ERD should reflect: - The new collection or embedded document structure - Any relationships with other collections/entities - The data types, constraints, and business rules that are relevant for MongoDB - Ensure the ERD is clear and follows best practices for MongoDB schema design Each entity in the ERD should have the following layout: **Entity Name**: The name of the MongoDB collection / schema **Fields**: A list of fields in the collection, including: - Field Name (in camel case) - Data Type (e.g., String, Number, Date, ObjectId) - Constraints (e.g. indexed, unique, not null, nullable) In this example, Liquibase was used as a changelog to supply the necessary context, detailing entities, columns, data types, and relationships. Based on this, Copilot could offer architectural recommendations for new document or collection types, including whether to embed documents or use separate collections with cache references for lookup data. Copilot can also generate an entity relationship diagram (ERD), allowing for review and validation before proceeding. From there, a new data access layer can be generated, configurable to switch between SQL Server and MongoDB as needed. While production environments typically standardize on a single database model, this demonstration showcased the speed and flexibility with which strategic architectural components can be introduced using GitHub Copilot. 👨‍💻Conclusion This modernization initiative demonstrated how strategic use of automation and best practices can transform legacy Microsoft Access solutions into scalable, maintainable architectures utilizing Node.js, SQL Server, MongoDB, and OpenAPI. By carefully planning each migration layer—from database and API specifications to business logic—the team preserved core functionality while introducing modern standards and enhanced capabilities. GitHub Copilot played a pivotal role, not only speeding up redevelopment but also improving code quality through automated documentation, test generation, and meaningful naming conventions. The result was a significant reduction in development time, with a robust, standards-compliant system delivered in just two weeks compared to an estimated six to eight months using traditional manual methods. This project serves as a blueprint for organizations seeking to modernize their Access-based applications, highlighting the efficiency gains and quality improvements that can be achieved by leveraging AI-powered tools and well-defined migration strategies. The approach ensures future scalability, easier maintenance, and alignment with contemporary enterprise requirements.
anthkernan
Dec 09, 2025 Place Azure Architecture Blog
800Views
1like
1Comment