architecture
7 Topics[Architecture Pattern] Scaling Sync-over-Async Edge Gateways by Bypassing Service Bus Sessions
Hi everyone, I wanted to share an architectural pattern and an open-source implementation we recently built to solve a major scaling bottleneck at the edge: bridging legacy synchronous HTTP clients to long-running asynchronous AI workers. The Problem: Stateful Bottlenecks at the Edge When dealing with slow AI generation tasks (e.g., 45+ seconds), standard REST APIs will drop the connection resulting in 504 Gateway Timeouts. The standard integration pattern here is Sync-over-Async. The Gateway accepts the HTTP request, drops a message onto Azure Service Bus, waits for the worker to reply, and maps the reply back to the open HTTP connection. However, the default approach is to use Service Bus Sessions for request-reply correlation. At scale, this introduces severe limitations: 1. Stateful Gateways: The Gateway pod must request an exclusive lock on the session. It becomes tightly coupled to that specific request. 2. Horizontal Elasticity is Broken: If a reply arrives, it must go to the specific pod holding the lock. Other idle pods cannot assist. 3. Hard Limits: A traffic spike easily exhausts the namespace concurrent session limits (especially on the Standard tier). The Solution: Stateless Filtered Topics To achieve true horizontal scale, the API Gateway layer must be 100% stateless. We bypassed Sessions entirely by pushing the routing logic down to the broker using a Filtered Topic Pattern. How it works: 1. The Gateway injects a CorrelationId property (e.g., Instance-A-Req-1) into the outbound request. 2. Instead of locking a session, the Gateway spins up a lightweight, dynamic subscription on a shared Reply Topic with a SQL Filter: CorrelationId = 'Instance-A-Req-1'. 3. The AI worker processes the task and drops the reply onto the shared topic with the same property. 4. The Azure Service Bus broker evaluates the SQL filter and pushes the message directly to the correct Gateway pod. No session locks. No implicit instance affinity. Complete horizontal scalability. If a pod crashes, its temporary subscription simply drops—preventing locked poison messages. Open Source Implementation Implementing dynamic Service Bus Administration clients and receiver lifecycles is complex, so I abstracted this pattern into a Spring Boot starter for the community. It handles all the dynamic subscription and routing logic under the hood, allowing developers to execute highly scalable Sync-over-Async flows with a single line of code returning a CompletableFuture. GitHub Repository: https://github.com/ShivamSaluja/sentinel-servicebus-starter Full Technical Write-up: https://dev.to/shivamsaluja/sync-over-async-bypassing-azure-service-bus-session-limits-for-ai-workloads-269d I would love to hear from other architects in this hub. Have you run into similar session exhaustion limits when building Edge API Gateways? Have you adopted similar stateless broker-side routing, or do you rely on sticky sessions at your load balancers?21Views0likes0CommentsDetecting ACI IP Drift and Auto-Updating Private DNS (A + PTR) with Event Grid + Azure Functions
Solution Author Aditya_AzureNinja , Chiragsharma30 Solution Version v1.0 TL;DR Azure Container Instances (ACI) container groups can be recreated/updated over time and may receive new private IPs, which can cause DNS mismatches if forward and reverse records aren’t updated. This post shares an event-driven pattern that detects ACI IP drift and automatically reconciles Private DNS A (forward) and PTR (reverse) records using Event Grid + Azure Functions. Key requirement: Event delivery is at-least-once, so the solution must be idempotent. Problem statement In hub-and-spoke environments using per-spoke Private DNS zones for isolation, ACI workloads created/updated/deleted over time can receive new private IPs. We need to ensure: Forward lookup: aci-name.<spoke-zone> (A record) → current ACI private IP Reverse lookup: IP → aci-name.<spoke-zone> (PTR record) Two constraints drive this design: Azure Private DNS auto-registration is VM-only and does not create PTR records, so ACI needs explicit A/PTR record management. Reverse DNS is scoped to the VNet (reverse zone must be linked to the querying VNet, otherwise reverse lookup returns NXDOMAIN). Design principle: This solution was designed with the following non‑negotiable engineering goals: Event‑driven DNS updates must be triggered directly from resource lifecycle events, not polling or scheduled jobs. Container creation, restart, and deletion are the only reliable sources of truth for IP changes in ACI. Idempotent Azure Event Grid delivers events with at‑least‑once semantics. The system must safely process duplicate events without creating conflicting DNS records or failing on retries. Stateless The automation must not rely on in‑memory or persisted state to determine correctness. DNS itself is treated as the baseline state, allowing functions to scale, restart, and replay events without drift or dependency on prior executions. Clear failure modes DNS reconciliation failures must be explicit and observable. If DNS updates fail, the function invocation must fail loudly so the issue is visible, alertable, and actionable—never silently ignored. Components Event Grid subscriptions (filtered to ACI container group lifecycle events) Azure Function App (Python) with System Assigned Managed Identity Private DNS forward zone (A records) Private DNS reverse zone (PTR records) Supporting infra (typical): Storage account (function artifacts / operational needs) Application Insights + Log Analytics (observability) Event-driven flow ACI container group is created/updated/deleted. Event Grid emits a lifecycle event (delivery can be repeated). Function is triggered and reads the current ACI private IP. Function reconciles DNS: Upsert A record to current IP Upsert PTR record to FQDN Remove stale PTR(s) for hostname/IP as needed Function logs reconciliation outcome (updated vs no-op). Architecture overview (INFRA) This follows the“Event-driven registration” approach: Event Grid → Azure Function that reconciles DNS on ACI lifecycle events. RBAC at a glance (Managed Identity) Role Scope Purpose Storage Blob Data Owner Function App deployment storage account Access function artifacts and operational blobs (required because shared key access is disabled). Reader Each ACI workload resource group Read container group state and determine the current private IP. Private DNS Zone Contributor Private DNS forward zone(s) Create, update, and delete A records for ACI hostnames. Private DNS Zone Contributor Private DNS reverse zone(s) Create, update, and clean up PTR records for ACI IPs. Monitoring Metrics Publisher (optional) Data Collection Rule (DCR) Upload structured IP‑drift events to Log Analytics via the ingestion API. --- --- Architecture overview (APP) Event‑Driven DNS Reconciliation for Azure Container Instances 1. Event contract: what the function receives Azure Event Grid delivers events using a consistent envelope (Event Grid schema). Each event includes, at a minimum: topic subject id eventType eventTime data dataVersion metadataVersion In Azure Functions, the Event Grid trigger binding is the recommended way to receive these events directly. Why the subject field matters The subject field typically contains the ARM resource ID path of the affected resource. This solution relies on subject to: verify that the event is for an ACI container group (Microsoft.ContainerInstance/containerGroups) extract: subscription ID resource group name container group name Using subject avoids dependence on publisher‑specific payload fields and keeps parsing fast, deterministic, and resilient. 2. Subscription design: filter hard, process little The solution follows a strict runbook pattern: subscribe only to ARM lifecycle events filter aggressively so only ACI container groups are included trigger reconciliation only on meaningful state transitions Recommended Event Grid event types Microsoft.Resources.ResourceWriteSuccess (create / update / stop state changes) Microsoft.Resources.ResourceDeleteSuccess (container group deletion) Microsoft.Resources.ResourceActionSuccess (optional) (restart / start / stop actions, environment‑dependent) This keeps the Function App simple, predictable, and low‑noise. 3. Application design: two functions, one contract The application is intentionally split into authoritative mutation and read‑only validation. Component A — DNS Reconciler (authoritative writer) A thin Python v2 model wrapper: receives the Event Grid event validates this is an ACI container group event parses identifiers from the ARM subject resolves DNS configuration from a JSON mapping (environment variable) delegates DNS mutation to a deterministic worker script DNS changes are not implemented inline in Python. Instead, the function: constructs a controlled set of environment variables invokes a worker script (/bin/bash) via subprocess streams stdout/stderr into function logs treats non‑zero exit codes as hard failures This thin wrapper + deterministic worker pattern isolates DNS correctness logic while keeping the event handler stable and testable. Component B — IP Drift Tracker (stateless observer) The drift tracker is a read‑only, stateless validator designed for correctness monitoring. It: parses identifiers from the event subject exits early on delete events (nothing to validate) reads the live ACI private IP using the Azure SDK reads the current DNS A record baseline compares live vs DNS state and emits drift telemetry Core comparison logic No DNS record exists → emit first_seen DNS record matches live IP → emit no_change DNS record differs from live IP → emit drift_detected (old/new IP) Optionally, drift events can be shipped to Log Analytics using DCR‑based ingestion. 4. DNS Reconciler: execution flow Step 1 — Early filtering Reject any event whose subject does not contain: Microsoft.ContainerInstance/containerGroups. This avoids unnecessary processing and ensures strict contract enforcement. Step 2 — ARM subject parsing The function splits the subject path and extracts: resource group container group name This approach is fast, robust, and avoids publisher‑specific schema dependencies. Step 3 — Zone configuration resolution DNS configuration is resolved from a JSON map stored in an environment variable. If no matching configuration exists for the resource group: the function logs the condition exits without error Why this matters This keeps the solution multi‑environment without duplicating deployments. Only configuration changes — not code — are required. Step 4 — Delegation to worker logic The function constructs a deterministic runtime context and invokes the worker: forward zone name reverse zone name(s) container group name current private IP TTL and execution flags The worker performs reconciliation and exits with explicit success or failure. 5. What “reconciliation” actually means Reconciliation follows clear, idempotent semantics. Create / Update events Upsert A record if record exists and matches current IP → no‑op else → create or overwrite with new IP Upsert PTR record compute PTR name using IP octets and reverse zone alignment create or overwrite PTR to hostname.<forward-zone> Delete events delete the A record for the hostname scan PTR record sets: remove targets matching the hostname delete record set if empty All operations are safe to repeat. 6. Why IP drift tracking is separate DNS reconciliation enforces correctness at event time, but drift can still occur due to: manual DNS edits partial failures delete / recreate race conditions unexpected redeployments or restarts The drift tracker exists as a continuous correctness validator, not as a repair mechanism. This separation keeps responsibilities clear: Reconciler → fixes state Drift tracker → observes and reports state 7. Observability: correctness vs runtime health There is an important distinction: Runtime health container crashes image pull failures restarts platform events (visible in standard ACI / Container logs) DNS correctness A record != live IP missing PTR records stale reverse mappings The IP Drift Tracker provides this correctness layer, which complements — not replaces — runtime monitoring. 8. Engineering constraints that shape the design At‑least‑once delivery → idempotency Event Grid delivery must be treated as at‑least‑once. Every reconciliation action is safe to execute multiple times. Explicit failure behavior If the worker script returns a non‑zero exit code: the function invocation fails the failure is visible and alertable incorrect DNS does not silently persist202Views2likes0CommentsAdmin‑On‑Behalf‑Of issue when purchasing subscription
Hello everyone! I want to reach out to you on the internet and ask if anyone has the same issue as we do when creating PAYG Azure subscriptions in a customer's tenant, in which we have delegated access via GDAP through PartnerCenter. It is a bit AI formatted question. When an Azure NCE subscription is created for a customer via an Indirect Provider portal, the CSP Admin Agent (foreign principal) is not automatically assigned Owner on the subscription. As a result: AOBO (Admin‑On‑Behalf‑Of) does not activate The subscription is invisible to the partner when accessing Azure via Partner Center service links The partner cannot manage and deploy to a subscription they just provided This breaks the expected delegated administration flow. Expected Behavior For CSP‑created Azure subscriptions: The CSP Admin Agent group should automatically receive Owner (or equivalent) on the subscription AOBO should work immediately, without customer involvement The partner should be able to see the subscription in Azure Portal and deploy resources Actual Behavior Observed For Azure NCE subscriptions created via an Indirect Provider: No RBAC assignment is created for the foreign AdminAgent group The subscription is visible only to users inside the customer tenant Partner Center role (Admin Agent foreign group) is present, but without Azure RBAC. Required Customer Workaround For each new Azure NCE subscription, the customer must: Sign in as Global Admin Use “Elevate access to manage all Azure subscriptions and management groups” Assign themselves Owner on the subscription Manually assign Owner to the partner’s foreign AdminAgent group Only after this does AOBO start working. Example Partner tries to access the subscription: https://portal.azure.com/#@customer.onmicrosoft.com/resource/subscriptions/<subscription-id>/overview But there is no subscription visible "None of the entries matched the given filter" https://learn.microsoft.com/en-us/azure/role-based-access-control/elevate-access-global-admin?tabs=azure-portal%2Centra-audit-logs#step-1-elevate-access-for-a-global-administrator from the customer's global admin. and manual RBAC fix in Cloud console: az role assignment create \ --assignee-object-id "<AdminAgent-Foreign-Group-ObjectId>" \ --role "Owner" \ --scope "/subscriptions/<subscription-id>" \ --assignee-principal-type "ForeignGroup" After this, AOBO works as expected for delegated administrators (foreign user accounts). Why This Is a Problem Partners sell Azure subscriptions that they cannot access Forces resources from customers to involvement from customers Breaks delegated administration principles For Indirect CSPs managing many tenants, this is a decent operational blocker. Key Question to Microsoft / Community Does anyone else struggle with this? Is this behavior by design for Azure NCE + Indirect CSP? Am I missing some point of view on why not to do it in the suggested way?79Views0likes0CommentsUpskilling for Technical Architect Design in Azure
I have a strong technical background and am interested in gaining hands-on experience with technical architect design. Could you please advise on the best ways to acquire these skills? I researched alternatives to ERP (Enterprise resource planning) on Google and found the list below of Azure components. Could you please also help to confirm whether I selected the appropriate data from Azure components? Enterprise resource planning Azure Service Finance Microsoft Cost Management Microsoft Dynamics 365 Finance Microsoft Power BI Azure Collaboration Services (Teams, sharepoint to improve collaboration and communication Purchasing Microsoft Dynamics 365 Business Central Power Automate Azure logic Apps custom development Manufacturing Teams for colloboration Dynamics 365 for Field Service Power BI Inventory Mgmt Azure Integration with Existing Systems: Azure SQL Database, Azure Functions, and Azure App Service CRM & Sales Dynamics 365 Sales HR Microsoft Dynamics 365 HR Azure Marketplace HR Apps Build your own HR solution - Azure SQL database, Azure logic and Power BI eCommerce Teams, Azure Boards, MS project and Power BI Service Project Mgmt Reporting Azure DevOPS I've also developed a basic architectural design for maintaining a centralized database for all key components. Would you be willing to review it for any potential improvements? I'm eager to learn and refine my design based on your feedback.524Views0likes1CommentVideo Recording: Azure Architecture Best Practices
The video recording from the free online event where I was presenting together with Microsoft Cloud Solution Architect, Dominik Zemp, about Azure Architecture Best Practices is now available. In this session, you will learn about proven guidance that’s designed to help you, architect, create and implement the business and technology strategies necessary for your organization to succeed in the cloud. It provides best practices, documentation, and tools that cloud architects, IT professionals, and business decision-makers need to successfully achieve their short- and long-term objectives. We will be focusing on topics like the Cloud Adoption Framework and the new Enterprise-Scale landing zone architecture. Azure Architecture Best Practices Virtual Event Agenda: Introduction Why Azure Architecture? Introduction to the Cloud Adoption Framework What is Enterprise-Scale? Build landing zones with Enterprise-Scale Critical design areas Deployment using AzOps Demo Build on top of Enterprise-Scale – Well-Architected Framework for workloads and apps Q&A
3KViews8likes2Comments