Recent Discussions
RC4 Deprecating by April
I’m reviewing our Seamless SSO setup and noticed that the AzureADSSOAcc account is still using RC4 (encryption type 0x17) from Kerberos event logs. I have a few questions regarding this: Why does AzureADSSOAcc still default to RC4 instead of AES, even when the domain supports AES? With Microsoft disabling RC4 (April updates), will AzureADSSOAcc automatically switch to AES? If it does not switch automatically, what is the recommended way to force it to use AES? Is running Update-AzureADSSOForest (key rotation) sufficient, and does it cause any downtime or impact to Seamless SSO? I want to make sure we transition to AES safely without breaking SSO for users. Any guidance or real-world experience would be appreciated.7Views0likes0CommentsThe March 2026 Innovation Challenge Winners
For this round of the Innovation Challenge the organizations we sponsor helped over 15,000 developers get the skills it takes to build AI solutions on Azure. This program in grounded in Microsoft’s mission and designed to enable a diverse and qualified community of professional developers coming together to tackle big problems. We helped almost 1,000 people earn Microsoft certifications and Applied Skills credentials, and 300 participated in the invitation only March 2026 Innovation Challenge hackathon. Teams represented SHPE, Women in Cloud, Código Facilito, DIO, GenSpark, NASA Space Apps, Project Blue Mountain, and TechBridge. Check out the winning project to meet some of the best AI talent in our community and to get inspired about what we can build together! First place $10,000 Pebble. - AI Cognitive Load Companion Pebble. is named after a worry stone: something small and smooth you reach for when the world feels like too much. It's an AI cognitive support companion that turns overwhelming documents, tasks, and information into calm, structured clarity. Built for neurodivergent minds. Useful for everyone. Second place $5,000 The Living Memory Bridge We believe dementia represents the most extreme form of cognitive overload that exists. It is not just information overload. It is cognitive loss: the gradual erosion of the very tools people use to process the world. Every principle in the brief applies here in its most urgent form: simplified language, adaptive communication, calm and dignity-preserving interactions, personalized memory anchors, and support that meets people exactly where they are. Query to Insight Analytics CRAM CRAM is a natural language healthcare analytics platform built entirely on Azure that lets clinical and administrative staff query a patient database using plain English, no SQL required. Users type a question like "What are the top 10 conditions among diabetic patients?" and get back a written summary, a data table, and an auto-generated chart in seconds. Third place $2,500 ClearStep ClearStep is an action-first AI system designed to reduce decision overload in high-risk or confusing situations. Instead of only detecting risk, it tells users exactly what to do next. The core innovation is architectural: model output is not trusted. Every response is enforced by a validation layer that guarantees structure, corrects model errors, and prevents unsafe or misleading outputs from reaching the user. DataTalk Our platform enables seamless data ingestion from Excel, CSV, SharePoint, and OneDrive, processes it through a two-layer analytical pipeline powered by DuckDB, and orchestrates four specialized AI agents that work as a team: understanding intent, reading data structure, generating and self-correcting SQL, and enforcing security and auditability at every step RAGulator AI Governance Engine Advanced, governed, and traceable RAG (Retrieval-Augmented Generation) system for international trade. RAGulator is a 100% functional solution that unifies the Azure intelligence ecosystem to deliver grounded responses with immutable bibliographic citations.244Views0likes0CommentsGraphic issue on single session host personal avd
We recently deployed single session host with azure gallery image(windows1125H2enterprise+m365apps) and random users are facing graphic issue on the avd,screen fully get blue line unable to see anything on the display,how to resolve this?9Views0likes0CommentsDetecting ACI IP Drift and Auto-Updating Private DNS (A + PTR) with Event Grid + Azure Functions
Solution Author Aditya_AzureNinja , Chiragsharma30 Solution Version v1.0 TL;DR Azure Container Instances (ACI) container groups can be recreated/updated over time and may receive new private IPs, which can cause DNS mismatches if forward and reverse records aren’t updated. This post shares an event-driven pattern that detects ACI IP drift and automatically reconciles Private DNS A (forward) and PTR (reverse) records using Event Grid + Azure Functions. Key requirement: Event delivery is at-least-once, so the solution must be idempotent. Problem statement In hub-and-spoke environments using per-spoke Private DNS zones for isolation, ACI workloads created/updated/deleted over time can receive new private IPs. We need to ensure: Forward lookup: aci-name.<spoke-zone> (A record) → current ACI private IP Reverse lookup: IP → aci-name.<spoke-zone> (PTR record) Two constraints drive this design: Azure Private DNS auto-registration is VM-only and does not create PTR records, so ACI needs explicit A/PTR record management. Reverse DNS is scoped to the VNet (reverse zone must be linked to the querying VNet, otherwise reverse lookup returns NXDOMAIN). Design principle: This solution was designed with the following non‑negotiable engineering goals: Event‑driven DNS updates must be triggered directly from resource lifecycle events, not polling or scheduled jobs. Container creation, restart, and deletion are the only reliable sources of truth for IP changes in ACI. Idempotent Azure Event Grid delivers events with at‑least‑once semantics. The system must safely process duplicate events without creating conflicting DNS records or failing on retries. Stateless The automation must not rely on in‑memory or persisted state to determine correctness. DNS itself is treated as the baseline state, allowing functions to scale, restart, and replay events without drift or dependency on prior executions. Clear failure modes DNS reconciliation failures must be explicit and observable. If DNS updates fail, the function invocation must fail loudly so the issue is visible, alertable, and actionable—never silently ignored. Components Event Grid subscriptions (filtered to ACI container group lifecycle events) Azure Function App (Python) with System Assigned Managed Identity Private DNS forward zone (A records) Private DNS reverse zone (PTR records) Supporting infra (typical): Storage account (function artifacts / operational needs) Application Insights + Log Analytics (observability) Event-driven flow ACI container group is created/updated/deleted. Event Grid emits a lifecycle event (delivery can be repeated). Function is triggered and reads the current ACI private IP. Function reconciles DNS: Upsert A record to current IP Upsert PTR record to FQDN Remove stale PTR(s) for hostname/IP as needed Function logs reconciliation outcome (updated vs no-op). Architecture overview (INFRA) This follows the“Event-driven registration” approach: Event Grid → Azure Function that reconciles DNS on ACI lifecycle events. RBAC at a glance (Managed Identity) Role Scope Purpose Storage Blob Data Owner Function App deployment storage account Access function artifacts and operational blobs (required because shared key access is disabled). Reader Each ACI workload resource group Read container group state and determine the current private IP. Private DNS Zone Contributor Private DNS forward zone(s) Create, update, and delete A records for ACI hostnames. Private DNS Zone Contributor Private DNS reverse zone(s) Create, update, and clean up PTR records for ACI IPs. Monitoring Metrics Publisher (optional) Data Collection Rule (DCR) Upload structured IP‑drift events to Log Analytics via the ingestion API. --- --- Architecture overview (APP) Event‑Driven DNS Reconciliation for Azure Container Instances 1. Event contract: what the function receives Azure Event Grid delivers events using a consistent envelope (Event Grid schema). Each event includes, at a minimum: topic subject id eventType eventTime data dataVersion metadataVersion In Azure Functions, the Event Grid trigger binding is the recommended way to receive these events directly. Why the subject field matters The subject field typically contains the ARM resource ID path of the affected resource. This solution relies on subject to: verify that the event is for an ACI container group (Microsoft.ContainerInstance/containerGroups) extract: subscription ID resource group name container group name Using subject avoids dependence on publisher‑specific payload fields and keeps parsing fast, deterministic, and resilient. 2. Subscription design: filter hard, process little The solution follows a strict runbook pattern: subscribe only to ARM lifecycle events filter aggressively so only ACI container groups are included trigger reconciliation only on meaningful state transitions Recommended Event Grid event types Microsoft.Resources.ResourceWriteSuccess (create / update / stop state changes) Microsoft.Resources.ResourceDeleteSuccess (container group deletion) Microsoft.Resources.ResourceActionSuccess (optional) (restart / start / stop actions, environment‑dependent) This keeps the Function App simple, predictable, and low‑noise. 3. Application design: two functions, one contract The application is intentionally split into authoritative mutation and read‑only validation. Component A — DNS Reconciler (authoritative writer) A thin Python v2 model wrapper: receives the Event Grid event validates this is an ACI container group event parses identifiers from the ARM subject resolves DNS configuration from a JSON mapping (environment variable) delegates DNS mutation to a deterministic worker script DNS changes are not implemented inline in Python. Instead, the function: constructs a controlled set of environment variables invokes a worker script (/bin/bash) via subprocess streams stdout/stderr into function logs treats non‑zero exit codes as hard failures This thin wrapper + deterministic worker pattern isolates DNS correctness logic while keeping the event handler stable and testable. Component B — IP Drift Tracker (stateless observer) The drift tracker is a read‑only, stateless validator designed for correctness monitoring. It: parses identifiers from the event subject exits early on delete events (nothing to validate) reads the live ACI private IP using the Azure SDK reads the current DNS A record baseline compares live vs DNS state and emits drift telemetry Core comparison logic No DNS record exists → emit first_seen DNS record matches live IP → emit no_change DNS record differs from live IP → emit drift_detected (old/new IP) Optionally, drift events can be shipped to Log Analytics using DCR‑based ingestion. 4. DNS Reconciler: execution flow Step 1 — Early filtering Reject any event whose subject does not contain: Microsoft.ContainerInstance/containerGroups. This avoids unnecessary processing and ensures strict contract enforcement. Step 2 — ARM subject parsing The function splits the subject path and extracts: resource group container group name This approach is fast, robust, and avoids publisher‑specific schema dependencies. Step 3 — Zone configuration resolution DNS configuration is resolved from a JSON map stored in an environment variable. If no matching configuration exists for the resource group: the function logs the condition exits without error Why this matters This keeps the solution multi‑environment without duplicating deployments. Only configuration changes — not code — are required. Step 4 — Delegation to worker logic The function constructs a deterministic runtime context and invokes the worker: forward zone name reverse zone name(s) container group name current private IP TTL and execution flags The worker performs reconciliation and exits with explicit success or failure. 5. What “reconciliation” actually means Reconciliation follows clear, idempotent semantics. Create / Update events Upsert A record if record exists and matches current IP → no‑op else → create or overwrite with new IP Upsert PTR record compute PTR name using IP octets and reverse zone alignment create or overwrite PTR to hostname.<forward-zone> Delete events delete the A record for the hostname scan PTR record sets: remove targets matching the hostname delete record set if empty All operations are safe to repeat. 6. Why IP drift tracking is separate DNS reconciliation enforces correctness at event time, but drift can still occur due to: manual DNS edits partial failures delete / recreate race conditions unexpected redeployments or restarts The drift tracker exists as a continuous correctness validator, not as a repair mechanism. This separation keeps responsibilities clear: Reconciler → fixes state Drift tracker → observes and reports state 7. Observability: correctness vs runtime health There is an important distinction: Runtime health container crashes image pull failures restarts platform events (visible in standard ACI / Container logs) DNS correctness A record != live IP missing PTR records stale reverse mappings The IP Drift Tracker provides this correctness layer, which complements — not replaces — runtime monitoring. 8. Engineering constraints that shape the design At‑least‑once delivery → idempotency Event Grid delivery must be treated as at‑least‑once. Every reconciliation action is safe to execute multiple times. Explicit failure behavior If the worker script returns a non‑zero exit code: the function invocation fails the failure is visible and alertable incorrect DNS does not silently persistBuilding Multi-Agent Orchestration Using Microsoft Semantic Kernel: A Complete Step-by-Step Guide
What You Will Build By the end of this guide, you will have a working multi-agent system where 4 specialist AI agents collaborate to diagnose production issues: ClientAnalyst — Analyzes browser, JavaScript, CORS, uploads, and UI symptoms NetworkAnalyst — Analyzes DNS, TCP/IP, TLS, load balancers, and firewalls ServerAnalyst — Analyzes backend logs, database, deployments, and resource limits Coordinator — Synthesizes all findings into a root cause report with a prioritized action plan These agents don't just run in sequence — they debate, cross-examine, and challenge each other's findings through a shared conversation, producing a diagnosis that's better than any single agent could achieve alone. Table of Contents Why Multi-Agent? The Problem with Single Agents Architecture Overview Understanding the Key SK Components The Actor Model — How InProcessRuntime Works Setting Up Your Development Environment Step-by-Step: Building the Multi-Agent Analyzer The Agent Interaction Flow — Round by Round Bugs I Found & Fixed — Lessons Learned Running with Different AI Providers What to Build Next 1. Why Multi-Agent? The Problem with Single Agents A single AI agent analyzing a production issue is like having one doctor diagnose everything — they'll catch issues in their specialty but miss cross-domain connections. Consider this problem: "Users report 504 Gateway Timeout errors when uploading files larger than 10MB. Started after Friday's deployment. Worse during peak hours." A single agent might say "it's a server timeout" and stop. But the real root cause often spans multiple layers: The client is sending chunked uploads with an incorrect Content-Length header (client-side bug) The load balancer has a 30-second timeout that's too short for large uploads (network config) The server recently deployed a new request body parser that's 3x slower (server-side regression) The combination only fails during peak hours because connection pool saturation amplifies the latency No single perspective catches this. You need specialists who analyze independently, then debate to find the cross-layer causal chain. That's what multi-agent orchestration gives you. The 5 Orchestration Patterns in SK Semantic Kernel provides 5 built-in patterns for agent collaboration: SEQUENTIAL: A → B → C → Done (pipeline — each builds on previous) CONCURRENT: ↗ A ↘ Task → B → Aggregate ↘ C ↗ (parallel — results merged) GROUP CHAT: A ↔ B ↔ C ↔ D ← We use this one (rounds, shared history, debate) HANDOFF: A → (stuck?) → B → (complex?) → Human (escalation with human-in-the-loop) MAGENTIC: LLM picks who speaks next dynamically (AI-driven speaker selection) We use GroupChatOrchestration with RoundRobinGroupChatManager because our problem requires agents to see each other's work, challenge assumptions, and build on each other's analysis across two rounds. 2. Architecture Overview Here's the complete architecture of what we're building: 3. Understanding the Key SK Components Before we write code, let's understand the 5 components we'll use and the design pattern each implements: ChatCompletionAgent — Strategy Pattern The agent definition. Each agent is a combination of: name — unique identifier (used in round-robin ordering) instructions — the persona and rules (this is the prompt engineering) service — which AI provider to call (Strategy Pattern — swap providers without changing agent logic) description — what other agents/tools understand about this agent agent = ChatCompletionAgent( name="ClientAnalyst", instructions="You are ONLY ClientAnalyst...", service=gemini_service, # ← Strategy: swap to OpenAI with zero changes description="Analyzes client-side issues", ) GroupChatOrchestration — Mediator Pattern The orchestration defines HOW agents interact. It's the Mediator — agents don't talk to each other directly. Instead, the orchestration manages a shared ChatHistory and routes messages through the Manager. RoundRobinGroupChatManager — Strategy Pattern The Manager decides WHO speaks next. RoundRobinGroupChatManager cycles through agents in a fixed order. SK also provides AutomaticGroupChatManager where the LLM decides who speaks next. max_rounds is the total number of messages per agent or cycle. With 4 agents and max_rounds=8, each agent speaks exactly twice. InProcessRuntime — Actor Model Abstraction The execution engine. Every agent becomes an "actor" with its own kind of mailbox (message queue). The runtime delivers messages between actors. Key properties: No shared state — agents communicate only through messages Sequential processing — each agent processes one message at a time Location transparency — same code works in-process today, distributed tomorrow agent_response_callback — Observer Pattern A function that fires after EVERY agent response. We use it to display each agent's output in real-time with emoji labels and round numbers. 4. The Actor Model — How InProcessRuntime Works The Actor Model is a concurrency pattern where each entity is an isolated "actor" with a private mailbox. Here's what happens inside InProcessRuntime when we run our demo: runtime.start() │ ├── Creates internal message loop (asyncio event loop) │ orchestration.invoke(task="504 timeout...", runtime=runtime) │ ├── Creates Actor[Orchestrator] → manages overall flow ├── Creates Actor[Manager] → RoundRobinGroupChatManager ├── Creates Actor[ClientAnalyst] → mailbox created, waiting ├── Creates Actor[NetworkAnalyst] → mailbox created, waiting ├── Creates Actor[ServerAnalyst] → mailbox created, waiting └── Creates Actor[Coordinator] → mailbox created, waiting Manager receives "start" message │ ├── Checks turn order: [Client, Network, Server, Coordinator] ├── Sends task to ClientAnalyst mailbox │ → ClientAnalyst processes: calls LLM → response │ → Response added to shared ChatHistory │ → callback fires (displayed in Notebook UI) │ → Sends "done" back to Manager │ ├── Manager updates: turn_index=1 ├── Sends to NetworkAnalyst mailbox │ → Same flow... │ ├── ... (ServerAnalyst, Coordinator for Round 1) │ ├── Manager checks: messages=4, max_rounds=8 → continue │ ├── Round 2: same cycle with cross-examination │ └── After message 8: Manager sends "complete" → OrchestrationResult resolves → result.get() returns final answer runtime.stop_when_idle() → All mailboxes empty → clean shutdown The Actor Model guarantees: No race conditions (each actor processes one message at a time) No deadlocks (no shared locks to contend for) No shared mutable state (agents communicate only via messages) 5. Setting Up Your Development Environment Prerequisites Python 3.11 or 3.12 (3.13+ may have compatibility issues with some SK connectors) Visual Studio Code with the Python and Jupyter extensions An API key from one of: Google AI Studio (free), OpenAI Step 1: Install Python Download from python.org. During installation, check "Add Python to PATH". Verify: python --version # Python 3.12.x Step 2: Install VS Code Extensions Open VS Code, go to Extensions (Ctrl+Shift+X), and install: Python (by Microsoft) — Python language support Jupyter (by Microsoft) — Notebook support Pylance (by Microsoft) — IntelliSense and type checking Step 3: Create Project Folder mkdir sk-multiagent-demo cd sk-multiagent-demo Open in VS Code: code . Step 4: Create Virtual Environment Open the VS Code terminal (Ctrl+`) and run: # Create virtual environment python -m venv sk-env # Activate it # Windows: sk-env\Scripts\activate # macOS/Linux: source sk-env/bin/activate You should see (sk-env) in your terminal prompt. Step 5: Install Semantic Kernel For Google Gemini (free tier — recommended for getting started): pip install semantic-kernel[google] python-dotenv ipykernel For OpenAI (paid API key): pip install semantic-kernel openai python-dotenv ipykernel For Azure AI Foundry (enterprise, Entra ID auth): pip install semantic-kernel azure-identity python-dotenv ipykernel Step 6: Register the Jupyter Kernel python -m ipykernel install --user --name=sk-env --display-name="Semantic Kernel (Python 3.12)" You can also select if this is already available from your environment from VSCode as below: Step 7: Get Your API Key Option A — Google Gemini (FREE, recommended for demo): Go to https://aistudio.google.com/apikey Click "Create API Key" Copy the key Free tier limits: 15 requests/minute, 1 million tokens/minute — more than enough for this demo. Option B — OpenAI: Go to https://platform.openai.com/api-keys Create a new key Copy the key Option C — Azure AI Foundry: Deploy a model in Azure AI Foundry portal Note the endpoint URL and deployment name If key-based auth is disabled, you'll need Entra ID with permissions Step 8: Create the .env File In your project root, create a file named .env: For Gemini: GOOGLE_AI_API_KEY=AIzaSy...your-key-here GOOGLE_AI_GEMINI_MODEL_ID=gemini-2.5-flash For OpenAI: OPENAI_API_KEY=sk-...your-key-here OPENAI_CHAT_MODEL_ID=gpt-4o For Azure AI Foundry: AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4o AZURE_OPENAI_API_KEY=your-key Step 9: Create the Notebook In VS Code: Click File > New File Save as multi_agent_analyzer.ipynb In the top-right of the notebook, click Select Kernel Choose Semantic Kernel (Python 3.12) (or your sk-env) Your environment is ready. Let's build. 6. Step-by-Step: Building the Multi-Agent Analyzer Cell 1: Verify Setup import semantic_kernel print(f"Semantic Kernel version: {semantic_kernel.__version__}") from semantic_kernel.agents import ( ChatCompletionAgent, GroupChatOrchestration, RoundRobinGroupChatManager, ) from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent print("All imports successful") Cell 2: Load API Key and Create Service For Gemini: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.google.google_ai import ( GoogleAIChatCompletion, GoogleAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory GEMINI_API_KEY = os.getenv("GOOGLE_AI_API_KEY") GEMINI_MODEL = os.getenv("GOOGLE_AI_GEMINI_MODEL_ID", "gemini-2.5-flash") service = GoogleAIChatCompletion( gemini_model_id=GEMINI_MODEL, api_key=GEMINI_API_KEY, ) print(f"Service created: Gemini {GEMINI_MODEL}") # Smoke test settings = GoogleAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") For OpenAI: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.open_ai import ( OpenAIChatCompletion, OpenAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory service = OpenAIChatCompletion( ai_model_id=os.getenv("OPENAI_CHAT_MODEL_ID", "gpt-4o"), ) print(f"Service created: OpenAI {os.getenv('OPENAI_CHAT_MODEL_ID', 'gpt-4o')}") # Smoke test settings = OpenAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") Cell 3: Define All 4 Agents This is the most important cell — the prompt engineering that makes the demo work: from semantic_kernel.agents import ChatCompletionAgent # ═══════════════════════════════════════════════════ # AGENT 1: Client-Side Analyst # ═══════════════════════════════════════════════════ client_agent = ChatCompletionAgent( name="ClientAnalyst", description="Analyzes problems from the client-side: browser, JS, CORS, caching, UI symptoms", instructions="""You are ONLY **ClientAnalyst**. You must NEVER speak as NetworkAnalyst, ServerAnalyst, or Coordinator. Every word you write is from ClientAnalyst's perspective only. You are a senior front-end and client-side diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the client side: 1. **Browser & Rendering**: DOM issues, JavaScript errors, CSS rendering, browser compatibility, memory leaks, console errors. 2. **Client-Side Caching**: Stale cache, service worker issues, local storage corruption. 3. **Network from Client View**: CORS errors, preflight failures, request timeouts, client-side retry storms, fetch/XHR configuration. 4. **Upload Handling**: File API usage, chunk upload implementation, progress tracking, FormData construction, content-type headers. 5. **UI/UX Symptoms**: What the user sees, error messages displayed, loading states. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference NetworkAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the client perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 2: Network Analyst # ═══════════════════════════════════════════════════ network_agent = ChatCompletionAgent( name="NetworkAnalyst", description="Analyzes problems from the network side: DNS, TCP, TLS, firewalls, load balancers, latency", instructions="""You are ONLY **NetworkAnalyst**. You must NEVER speak as ClientAnalyst, ServerAnalyst, or Coordinator. Every word you write is from NetworkAnalyst's perspective only. You are a senior network infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the network layer: 1. **DNS & Resolution**: DNS TTL, propagation delays, record misconfigurations. 2. **TCP/IP & Connections**: Connection pooling, keep-alive, TCP window scaling, connection resets, SYN floods. 3. **TLS/SSL**: Certificate issues, handshake failures, protocol version mismatches. 4. **Load Balancers & Proxies**: Sticky sessions, health checks, timeout configs, request body size limits, proxy buffering. 5. **Firewall & WAF**: Rule blocks, rate limiting, request inspection delays, geo-blocking, DDoS protection interference. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the network perspective - Do NOT just say 'I am ready to proceed' — provide substantive technical analysis Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 3: Server-Side Analyst # ═══════════════════════════════════════════════════ server_agent = ChatCompletionAgent( name="ServerAnalyst", description="Analyzes problems from the server side: backend app, database, logs, resources, deployments", instructions="""You are ONLY **ServerAnalyst**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator. Every word you write is from ServerAnalyst's perspective only. You are a senior backend and infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the server side: 1. **Application Server**: Error logs, exception traces, thread pool exhaustion, memory leaks, CPU spikes, garbage collection pauses. 2. **Database**: Slow queries, connection pool saturation, lock contention, deadlocks, replication lag, query plan changes. 3. **Deployment & Config**: Recent deployments, configuration changes, feature flags, environment variable mismatches, rollback candidates. 4. **Resource Limits**: File upload size limits, request body limits, disk space, temporary file cleanup, storage quotas. 5. **External Dependencies**: Upstream API timeouts, third-party service degradation, queue backlogs, cache (Redis/Memcached) issues. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and NetworkAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the server perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 4: Coordinator # ═══════════════════════════════════════════════════ coordinator_agent = ChatCompletionAgent( name="Coordinator", description="Synthesizes all specialist analyses into a final root cause report with prioritized action plan", instructions="""You are ONLY **Coordinator**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or ServerAnalyst. You synthesize — you do NOT do domain-specific analysis. You are the lead engineer who synthesizes the team's findings. ═══ ROUND 1 BEHAVIOR (your first turn, message 4) ═══ Keep this SHORT — maximum 300 words. - Note 2-3 KEY PATTERNS across the three analyses - Identify where specialists AGREE (high-confidence) - Identify where they CONTRADICT (needs resolution) - Ask 2-3 SPECIFIC QUESTIONS for Round 2 Round 1 MUST NOT: assign tasks, create action plans, write reports, or tell agents what to take lead on. Observation + questions ONLY. ═══ ROUND 2 BEHAVIOR (your final turn, message 8) ═══ Keep this FOCUSED — maximum 800 words. Produce a structured report: 1. **Root Cause** (1 paragraph): The #1 most likely cause with causal chain across layers. Reference specific findings from each specialist. 2. **Confidence** (short list): - HIGH: Areas where all 3 agreed - MEDIUM: Areas where 2 of 3 agreed - LOW: Disagreements needing investigation 3. **Action Plan** (numbered, max 6 items): For each: - What to do (specific) - Owner (Client/Network/Server team) - Time estimate 4. **Quick Wins vs Long-term** (2 short lists) Do NOT repeat what specialists already said verbatim. Synthesize, don't echo.""", service=service, ) # ═══════════════════════════════════════════════════ # All 4 agents — order = RoundRobin order # ═══════════════════════════════════════════════════ agents = [client_agent, network_agent, server_agent, coordinator_agent] print(f"{len(agents)} agents created:") for i, a in enumerate(agents, 1): print(f" {i}. {a.name}: {a.description[:60]}...") print(f"\nRoundRobin order: {' → '.join(a.name for a in agents)}") Cell 4: Run the Analysis from semantic_kernel.agents import GroupChatOrchestration, RoundRobinGroupChatManager from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent from IPython.display import display, Markdown # ╔══════════════════════════════════════════════════════════╗ # ║ EDIT YOUR PROBLEM STATEMENT HERE ║ # ╚══════════════════════════════════════════════════════════╝ PROBLEM = """ Users are reporting intermittent 504 Gateway Timeout errors when trying to upload files larger than 10MB through our web application. The issue started after last Friday's deployment and seems worse during peak hours (2-5 PM EST). Some users also report that smaller file uploads work fine but the progress bar freezes at 85% for large files before timing out. """ # ════════════════════════════════════════════════════════════ agent_responses = [] def agent_response_callback(message: ChatMessageContent) -> None: name = message.name or "Unknown" content = message.content or "" agent_responses.append({"agent": name, "content": content}) emoji = { "ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯" }.get(name, "🔹") round_num = (len(agent_responses) - 1) // len(agents) + 1 display(Markdown( f"---\n### {emoji} {name} (Message {len(agent_responses)}, Round {round_num})\n\n{content}" )) MAX_ROUNDS = 8 # 4 agents × 2 rounds = 8 messages exactly task = f"""## Problem Statement {PROBLEM.strip()} ## Discussion Rules You are in a GROUP DISCUSSION with 4 members. You can see ALL previous messages. There are exactly 2 rounds. ### ROUND 1 (Messages 1-4): Independent Analysis - ClientAnalyst, NetworkAnalyst, ServerAnalyst: Analyze from YOUR domain only. Give your top 3 most likely causes with evidence and reasoning. - Coordinator: Note patterns across the 3 analyses. Ask 2-3 specific questions. Do NOT assign tasks yet. ### ROUND 2 (Messages 5-8): Cross-Examination & Final Report - ClientAnalyst, NetworkAnalyst, ServerAnalyst: You MUST reference the OTHER specialists BY NAME. State where you agree, disagree, or have new insights. Answer the Coordinator's questions. Provide SUBSTANTIVE analysis. - Coordinator: Produce the FINAL structured report: root cause, confidence levels, prioritized action plan with owners and time estimates. IMPORTANT: Each agent speaks as THEMSELVES only. Never impersonate another agent.""" display(Markdown(f"## Problem Statement\n\n{PROBLEM.strip()}")) display(Markdown(f"---\n## Discussion Starting — {len(agents)} agents, {MAX_ROUNDS} rounds\n")) # Build and run orchestration = GroupChatOrchestration( members=agents, manager=RoundRobinGroupChatManager(max_rounds=MAX_ROUNDS), agent_response_callback=agent_response_callback, ) runtime = InProcessRuntime() runtime.start() result = await orchestration.invoke(task=task, runtime=runtime) final_result = await result.get(timeout=300) await runtime.stop_when_idle() display(Markdown(f"---\n## FINAL CONCLUSION\n\n{final_result}")) Cell 5: Statistics and Validation print("═" * 55) print(" ANALYSIS STATISTICS") print("═" * 55) emojis = {"ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯"} agent_counts = {} agent_chars = {} for r in agent_responses: agent_counts[r["agent"]] = agent_counts.get(r["agent"], 0) + 1 agent_chars[r["agent"]] = agent_chars.get(r["agent"], 0) + len(r["content"]) for agent, count in agent_counts.items(): em = emojis.get(agent, "🔹") chars = agent_chars.get(agent, 0) avg = chars // count if count else 0 print(f" {em} {agent}: {count} msg(s), ~{chars:,} chars (avg {avg:,}/msg)") print(f"\n Total messages: {len(agent_responses)}") total_chars = sum(len(r['content']) for r in agent_responses) print(f" Total analysis: ~{total_chars:,} characters") # Validation print(f"\n Validation:") import re identity_issues = [] for r in agent_responses: other_agents = [a.name for a in agents if a.name != r["agent"]] for other in other_agents: pattern = rf'(?i)as {re.escape(other)}[,:]?\s+I\b' if re.search(pattern, r["content"][:300]): identity_issues.append(f"{r['agent']} impersonated {other}") if identity_issues: print(f" Identity confusion: {identity_issues}") else: print(f" No identity confusion detected") thin = [r for r in agent_responses if len(r["content"].strip()) < 100] if thin: for t in thin: print(f" Thin response from {t['agent']}") else: print(f" All responses are substantive") Cell 6: Save Report from datetime import datetime timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"analysis_report_{timestamp}.md" with open(filename, "w", encoding="utf-8") as f: f.write(f"# Problem Analysis Report\n\n") f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n") f.write(f"**Agents:** {', '.join(a.name for a in agents)}\n") f.write(f"**Rounds:** {MAX_ROUNDS}\n\n---\n\n") f.write(f"## Problem Statement\n\n{PROBLEM.strip()}\n\n---\n\n") for i, r in enumerate(agent_responses, 1): em = emojis.get(r['agent'], '🔹') round_num = (i - 1) // len(agents) + 1 f.write(f"### {em} {r['agent']} (Message {i}, Round {round_num})\n\n") f.write(f"{r['content']}\n\n---\n\n") f.write(f"## Final Conclusion\n\n{final_result}\n") print(f"Report saved to: {filename}") 7. The Agent Interaction Flow — Round by Round Here's what actually happens during the 8-message orchestration: Round 1: Independent Analysis (Messages 1-4) Msg Agent What They See What They Do 1 ClientAnalyst Problem statement only Analyzes from client perspective: upload chunking, progress bar freezing at 85%, CORS, content-type headers 2 NetworkAnalyst Problem + ClientAnalyst's analysis Gives INDEPENDENT analysis despite seeing msg 1: load balancer timeouts, proxy body size limits, TCP window scaling 3 ServerAnalyst Problem + msgs 1-2 Gives INDEPENDENT analysis: recent deployment regression, request body parser, thread pool exhaustion, disk space 4 Coordinator Problem + msgs 1-3 Observes patterns: "All three mention timeout configuration. ClientAnalyst and NetworkAnalyst both point to body size. Question: Was the deployment a backend-only change or did it include infra?" Round 2: Cross-Examination (Messages 5-8) Msg Agent What They Do 5 ClientAnalyst "I agree with NetworkAnalyst that the load balancer timeout is likely a factor — the 85% freeze point matches the 30-second LB timeout for a 10MB upload on our average upload speed. However, I disagree with ServerAnalyst about thread pool exhaustion because the UI shows a clean 504, not a connection reset." 6 NetworkAnalyst "ServerAnalyst's point about the recent deployment is critical — if the new request parser is 3x slower, that would push uploads past the LB timeout. I can confirm the LB has a 30s idle timeout. The fix is both: increase LB timeout AND optimize the parser." 7 ServerAnalyst "Responding to Coordinator's question: The deployment was backend-only — a new multipart parser using streaming instead of buffered reads. ClientAnalyst is correct that the 504 is from the LB, not the app. The app itself returns 200 after 45 seconds, but the LB kills the connection at 30." 8 Coordinator Produces final structured report with root cause: "The backend deployment introduced a slower multipart parser (45s vs 15s for 10MB). The load balancer's 30s timeout kills the connection at ~85% progress. Fix: immediate — increase LB timeout to 120s. Short-term — optimize parser. Long-term — implement chunked uploads with progress resumption." Notice: The Round 2 analysis is dramatically better than Round 1. Agents reference each other by name, build on each other's findings, and the Coordinator can synthesize a cross-layer causal chain that no single agent could have produced. I made a small adjustment to the issue with Azure Web Apps. Please find the details below from testing carried out using Google Gemini: 8. Bugs I Found & Fixed — Lessons Learned Building this demo taught me several important lessons about multi-agent systems: Bug 1: Agents Speaking Only Once Symptom: Only 4 messages instead of 8. Root cause: The agents list was missing the Coordinator. It was defined in a separate cell and wasn't included in the members list. Fix: All 4 agents must be in the same list passed to GroupChatOrchestration. Bug 2: NetworkAnalyst Says "I'm Ready to Proceed" Symptom: NetworkAnalyst's Round 2 response was just "I'm ready to proceed with the analysis" — no actual content. Root cause: The Coordinator's Round 1 message was assigning tasks ("NetworkAnalyst, please check the load balancer config"), and the agent was acknowledging the assignment instead of analyzing. Fix: Added explicit constraint to Coordinator: "Round 1 MUST NOT assign tasks — observation + questions ONLY." Bug 3: ServerAnalyst Says "As NetworkAnalyst, I..." Symptom: ServerAnalyst's response started with "As NetworkAnalyst, I believe..." Root cause: LLM identity bleeding. When agents share ChatHistory, the LLM sometimes loses track of which agent it's currently playing. This is especially common with Gemini. Fix: Identity anchoring at the very top of every agent's instructions: "You are ONLY ServerAnalyst. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator." Bug 4: Gemini Gives Thin/Empty Responses Symptom: Some agents responded with just one sentence or "I concur." Root cause: Gemini 2.5 Flash is more concise than GPT-4o by default. Without explicit length requirements, it takes shortcuts. Fix: Added "Every response MUST be at least 200 words" and "Answer the Coordinator's questions" to every specialist's instructions. Bug 5: Coordinator's Report is 18K Characters Symptom: The Coordinator's Round 2 response was absurdly long — repeating everything every specialist said. Fix: Added word limits: "Round 1 max 300 words, Round 2 max 800 words" and "Synthesize, don't echo." Bug 6: MAX_ROUNDS Math Symptom: With MAX_ROUNDS=9, ClientAnalyst spoke a 3rd time after the Coordinator's final report — breaking the clean 2-round structure. Fix: MAX_ROUNDS must equal (number of agents × number of rounds). For 4 agents × 2 rounds = 8. 9. Running with Different AI Providers The beauty of SK's Strategy Pattern is that you change ONE LINE to switch providers. Everything else — agents, orchestration, callbacks, validation — stays identical. Gemini setup: from semantic_kernel.connectors.ai.google.google_ai import GoogleAIChatCompletion service = GoogleAIChatCompletion( gemini_model_id="gemini-2.5-flash", api_key=os.getenv("GOOGLE_AI_API_KEY"), ) OpenAI Setup from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion service = OpenAIChatCompletion( ai_model_id="gpt-4o", api_key=os.getenv("OPEN_AI_API_KEY"), ) 10. What to Build Next Add Plugins to Agents Give agents real tools — not just LLM reasoning - looks exciting right ;) class NetworkDiagnosticPlugin: (description="Pings a host and returns latency") def ping(self, host: str) -> str: result = subprocess.run(["ping", "-c", "3", host], capture_output=True, text=True) return result.stdout class LogSearchPlugin: (description="Searches server logs for error patterns") def search_logs(self, pattern: str, hours: int = 1) -> str: # Query your log aggregator (Splunk, ELK, Azure Monitor) return query_logs(pattern, hours) Add Filters for Governance Intercept every agent call for PII redaction and audit logging: .filter(filter_type=FilterTypes.FUNCTION_INVOCATION) async def audit_filter(context, next): print(f"[AUDIT] {context.function.name} called by agent") await next(context) print(f"[AUDIT] {context.function.name} returned") Try Different Orchestration Patterns Replace GroupChat with Sequential for a pipeline approach: # Instead of debate, each agent builds on the previous orchestration = SequentialOrchestration( members=[client_agent, network_agent, server_agent, coordinator_agent] ) Or Concurrent for parallel analysis: # All specialists analyze simultaneously, Coordinator aggregates orchestration = ConcurrentOrchestration( members=[client_agent, network_agent, server_agent] ) Deploy to Azure Move from InProcessRuntime to Azure Container Apps for production scaling. The agent code doesn't change — only the runtime. Summary The key insight from building this demo: multi-agent systems produce better results than single agents not because each agent is smarter, but because the debate structure forces cross-domain thinking that a single prompt can never achieve. The Coordinator's final report consistently identifies causal chains that span client, network, and server layers — exactly the kind of insight that production incident response teams need. Semantic Kernel makes this possible with clean separation of concerns: agents define WHAT to analyze, orchestration defines HOW they interact, the manager defines WHO speaks when, the runtime handles WHERE it executes, and callbacks let you OBSERVE everything. Each piece is independently swappable — that's the power of SK from Microsoft. Resources: GitHub: github.com/microsoft/semantic-kernel Docs: learn.microsoft.com/semantic-kernel Orchestration Patterns: learn.microsoft.com/semantic-kernel/frameworks/agent/agent-orchestration Discord: aka.ms/sk/discord Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for building AI agents with SK (Semantic Kernel) 😀93Views3likes0CommentsDocker Engine v29 on Linux: Why data-root No Longer Prevents OS Disk Growth (and How to Fix It)
Scope Applies to Linux hosts only Does not apply to Windows or Docker Desktop Problem Summary After upgrading to Docker Engine v29 or reimaging Linux nodes with this version, you may observe unexpected growth on the OS disk, even when Docker is configured with a custom data-root pointing to a mounted data disk. This commonly affects cloud environments (VMSS, Azure Batch, self‑managed Linux VMs) where the OS disk is intentionally kept small and container data is expected to reside on a separate data disk. What Changed in Docker Engine v29 (Linux) Starting with Docker Engine 29.0, containerd’s image store becomes the default storage backend on fresh installations. Docker explicitly documents this behavior: “The containerd image store is the default storage backend for Docker Engine 29.0 and later on fresh installations.” Docker containerd image store documentation Key points on Linux: Docker now delegates image and snapshot storage to containerd containerd uses its own content store and snapshotters Docker’s traditional data-root setting no longer controls all container storage Docker Engine v29 was released on 11 November 2025, and this behavior is by design, not a regression. Where Disk Usage Goes on Linux Docker’s daemon documentation clarifies the split: Legacy storage (pre‑v29 or upgraded installs): All data under /var/lib/docker Docker Engine v29 (containerd image store enabled): Images & snapshots → /var/lib/containerd Other Docker data (volumes, configs, metadata) → /var/lib/docker Crucially: “The data-root option does not affect image and container data stored in /var/lib/containerd when using the containerd image store.” Docker daemon data directory documentation This explains why OS disk usage continues to grow even when data-root is set to a data disk. Why the Old Configuration Worked Before On earlier Docker versions, Docker fully managed image and snapshot storage. Configuring: { "data-root": "/mnt/docker-data" } Was sufficient to redirect all container storage off the OS disk. With Docker Engine v29: containerd owns image and snapshot storage data-root only affects Docker‑managed data OS disk growth after upgrades or reimages is expected behavior This aligns fully with Docker’s documented design changes. Linux Workaround: Redirect containerd Storage To restore the intended behavior on Linux, keeping both Docker and containerd storage on the mounted data disk, containerd’s storage path must also be redirected. A practical workaround is to relocate /var/lib/containerd using a symbolic link. Example (Linux) sudo systemctl stop docker.socket docker containerd || true; sudo mkdir -p /mnt/docker-data /mnt/containerd; sudo rm -rf /var/lib/containerd; sudo ln -s /mnt/containerd /var/lib/containerd; echo "{\"data-root\": \"/mnt/docker-data\"}" | sudo tee /etc/docker/daemon.json; sudo systemctl daemon-reload; sudo systemctl start containerd docker' What This Does Stops Docker and containerd Creates container storage directories on the mounted data disk Redirects /var/lib/containerd → /mnt/containerd Keeps Docker’s data-root at /mnt/docker-data Restarts services with a unified storage layout This workaround is effective because it explicitly accounts for containerd‑managed paths introduced in Docker Engine v29, restoring the behavior that existed prior to the change. Key Takeaways Docker Engine v29 introduces a fundamental storage architecture change on Linux data-root alone is no longer sufficient OS disk growth after upgrades or reimages is expected containerd storage must also be redirected The workaround aligns with Docker’s official documentation and design References Docker daemon data directory https://docs.docker.com/engine/daemon/ containerd image store (Docker Engine v29) https://docs.docker.com/engine/storage/containerd/ Docker Engine v29 release notes https://docs.docker.com/engine/release-notes/29/69Views0likes0CommentsHow Solution Architects Can Use Microsoft’s Azure Globe Experience to Design Smarter Architectures
In today’s cloud-driven landscape, solution architects are expected to make decisions that balance performance, scalability, compliance, cost, and even sustainability. It’s no longer enough to simply choose the right services—you also need to choose the right locations. https://dellenny.com/how-solution-architects-can-use-microsofts-azure-globe-experience-to-design-smarter-cloud-architectures/33Views0likes0CommentsMy First TechCommunity Post: Azure VPN Gateway BGP Timer Mismatches
This is my first post on the Microsoft TechCommunity. Today is my seven-year anniversary at Microsoft. In my current role as a Senior Cloud Solution Architect supporting Infrastructure in Cloud & AI Platforms, I want to start by sharing a real-world lesson learned from customer engagements rather than a purely theoretical walkthrough. This work and the update of the official documentation on Microsoft Learn is the culmination of nearly two years of support for a very large global SD-WAN deployment with hundreds of site-to-site VPN connections into Azure VPN Gateway. The topic is deceptively simple—BGP timers—but mismatched expectations can cause significant instability when connecting on‑premises environments to Azure. If you’ve ever seen seemingly random BGP session resets, intermittent route loss, or confusing failover behavior, there’s a good chance that a timer mismatch between Azure and your customer premises equipment (CPE) was a contributing factor. Customer Expectation: BGP Timer Negotiation Many enterprise routers and firewalls support aggressive BGP timers and expect them to be negotiated during session establishment. A common configuration I see in customer environments looks like: Keepalive: 10 seconds Hold time: 30 seconds This configuration is not inherently wrong. In fact, it is often used intentionally to speed up failure detection and convergence in conventional network environments. My past experience with short timers was in a national cellular network carrier between core switching routers in adjacent racks, but all other connections used the default timer values. The challenge appears when that expectation is carried into Azure VPN Gateway. Azure VPN Gateway Reality: Fixed BGP Timers Azure VPN Gateway supports BGP but uses fixed timers (60/180) and won’t negotiate down. The timers are documented: The BGP keepalive timer is 60 seconds, and the hold timer is 180 seconds. Azure VPN Gateways use fixed timer values and do not support configurable keepalive or hold timers. This behavior is consistent across supported VPN Gateway SKUs that offer BGP support. Unlike some on‑premises devices, Azure will not adapt its timers downward during session establishment. What Happens During a Timer Mismatch When a CPE is configured with a 30‑second hold timer, it expects to receive BGP keepalives well within that window. Azure, however, sends BGP keepalives every 60 seconds. From the CPE’s point of view: No keepalive is received within 30 seconds The BGP hold timer expires The session is declared dead and torn down Azure may not declare the peer down on the same timeline as the CPE. This mismatch leads to repeated session flaps. The Hidden Side Effect: BGP State and Stability Controls During these rapid teardown and re‑establishment cycles, many CPE platforms rebuild their BGP tables and may increment internal routing metadata. When this occurs repeatedly: Azure observes unexpected and rapid route updates The BGP finite state machine is forced to continually reset and re‑converge BGP session stability is compromised CPE equipment logging may trigger alerts and internal support tickets. The resulting behavior is often described by customers as “Azure randomly drops routes” or “BGP is unstable”, when the instability originates from mismatched BGP timer expectations between the CPE and Azure VPN Gateway. Why This Is More Noticeable on VPN (Not ExpressRoute) This issue is far more common with VPN Gateway than with ExpressRoute. ExpressRoute supports BFD and allows faster failure detection without relying solely on aggressive BGP timers. VPN Gateway does not support BFD, so customers sometimes compensate by lowering BGP timers on the CPE—unintentionally creating this mismatch. The VPN path is Internet/WAN-like where delay/loss/jitter is normal, so conservative timer choices are stability-focused. Updated Azure Documentation The good news is that the official Azure documentation has been updated to clearly state the fixed BGP timer values for VPN Gateway: Keepalive: 60 seconds Hold time: 180 seconds Timer negotiation: Azure uses fixed timers Azure VPN Gateway FAQ | Microsoft Learn This clarification helps set the right expectations and prevents customers from assuming Azure behaves like conventional CPE routers. Practical Guidance If you are connecting a CPE to Azure VPN Gateway using BGP: Do not configure BGP timers lower than Azure’s defaults Align CPE timers to 60 / 180 or higher Avoid using aggressive timers as a substitute for BFD For further resilience: Consider Active‑Active VPN Gateways for better resiliency Use 4 Tunnels commonly implemented in a bowtie configuration for even better resiliency and traffic stability Closing Thoughts This is a great example of how cloud networking often behaves correctly, but differently than conventional on‑premises networking environments. Understanding those differences—and documenting them clearly—can save hours of troubleshooting and frustration. If this post helps even one engineer avoid a late‑night or multi-month BGP debugging session, then it has done its job. I did use AI (M365 Copilot) to aid in formatting and to validate technical accuracy. Otherwise, these are my thoughts. Thanks for reading my first TechCommunity post.164Views4likes0CommentsHelp wanted: Refresh articles in Azure Architecture Center (AAC)
I’m the Project Manager for architecture review boards (ARBs) in the Azure Architecture Center (AAC). We’re looking for subject matter experts to help us improve the freshness of the AAC, Cloud Adoption Framework (CAF), and Well-Architected Framework (WAF) repos. This opportunity is currently limited to Microsoft employees only. As an ARB member, your main focus is to review, update, and maintain content to meet quarterly freshness targets. Your involvement directly impacts the quality, relevance, and direction of Azure Patterns & Practices content across AAC, CAF, and WAF. The content in these repos reaches almost 900,000 unique readers per month, so your time investment has a big, global impact. The expected commitment is 4-6 hours per month, including attendance at weekly or bi-weekly sync meetings. Become an ARB member to gain: Increased visibility and credibility as a subject‑matter expert by contributing to Microsoft‑authored guidance used by customers and partners worldwide. Broader internal reach and networking without changing roles or teams. Attribution on Microsoft Learn articles that you own. Opportunity to take on expanded roles over time (for example, owning a set of articles, mentoring contributors, or helping shape ARB direction). We’re recruiting new members across several ARBs. Our highest needs are in the Web ARB, Containers ARB, and Data & Analytics ARB: The Web ARB focuses on modern web application architecture on Azure—App Service and PaaS web apps, APIs and API Management, ingress and networking (Application Gateway, Front Door, DNS), security and identity, and designing for reliability, scalability, and disaster recovery. The Containers ARB focuses on containerized and Kubernetes‑based architectures—AKS design and operations, networking and ingress, security and identity, scalability, and reliability for production container platforms. The Data & Analytics ARB focuses on data platform and analytics architectures—data ingestion and integration, analytics and reporting, streaming and real‑time scenarios, data security and governance, and designing scalable, reliable data solutions on Azure. We’re also looking for people to take ownership of other articles across AAC, CAF, and WAF. These articles span many areas, including application and solution architectures, containers and compute, networking and security, governance and observability, data and integration, and reliability and operational best practices. You don’t need to know everything—deep expertise in one or two areas and an interest in keeping Azure architecture guidance accurate and current is what matters most. Please reply to this post if you’re interested in becoming an ARB member, and I’ll follow up with next steps. If you prefer, you can email me at v-jodimartis@microsoft.com. Thanks! 🙂Help wanted: Refresh articles in Azure Architecture Center (AAC)
I’m the Project Manager for architecture review boards (ARBs) in the Azure Architecture Center (AAC). We’re looking for subject matter experts to help us improve the freshness of the AAC, Cloud Adoption Framework (CAF), and Well-Architected Framework (WAF) repos. This opportunity is currently limited to Microsoft employees only. As an ARB member, your main focus is to review, update, and maintain content to meet quarterly freshness targets. Your involvement directly impacts the quality, relevance, and direction of Azure Patterns & Practices content across AAC, CAF, and WAF. The content in these repos reaches almost 900,000 unique readers per month, so your time investment has a big, global impact. The expected commitment is 4-6 hours per month, including attendance at weekly or bi-weekly sync meetings. Become an ARB member to gain: Increased visibility and credibility as a subject‑matter expert by contributing to Microsoft‑authored guidance used by customers and partners worldwide. Broader internal reach and networking without changing roles or teams. Attribution on Microsoft Learn articles that you own. Opportunity to take on expanded roles over time (for example, owning a set of articles, mentoring contributors, or helping shape ARB direction). We’re recruiting new members across several ARBs. Our highest needs are in the Web ARB, Containers ARB, and Data & Analytics ARB: The Web ARB focuses on modern web application architecture on Azure—App Service and PaaS web apps, APIs and API Management, ingress and networking (Application Gateway, Front Door, DNS), security and identity, and designing for reliability, scalability, and disaster recovery. The Containers ARB focuses on containerized and Kubernetes‑based architectures—AKS design and operations, networking and ingress, security and identity, scalability, and reliability for production container platforms. The Data & Analytics ARB focuses on data platform and analytics architectures—data ingestion and integration, analytics and reporting, streaming and real‑time scenarios, data security and governance, and designing scalable, reliable data solutions on Azure. We’re also looking for people to take ownership of other articles across AAC, CAF, and WAF. These articles span many areas, including application and solution architectures, containers and compute, networking and security, governance and observability, data and integration, and reliability and operational best practices. You don’t need to know everything—deep expertise in one or two areas and an interest in keeping Azure architecture guidance accurate and current is what matters most. Please reply to this post if you’re interested in becoming an ARB member, and I’ll follow up with next steps. If you prefer, you can email me at v-jodimartis@microsoft.com. Thanks! 🙂37Views0likes0CommentsHi everyone!
Hi everyone! 👋 I’m new to this community and currently learning Azure Analytics. I’m really excited to be here and connect with people who have experience in this field. I believe the discussions and knowledge shared by members here are very valuable, and I’m looking forward to learning from all of you. If you have any advice, resources, or tips for someone starting with Azure Analytics, I’d really appreciate it. Happy to be part of this community! 😊40Views2likes0CommentsAgentic AI in IT: Self-Healing Systems and Smart Incident Response (Microsoft Ecosystem Perspective)
Modern IT infrastructures are evolving rapidly. Organizations now run workloads across hybrid cloud environments, microservices architectures, Kubernetes clusters, and distributed applications. Managing this complexity with traditional monitoring tools is becoming increasingly difficult. https://dellenny.com/agentic-ai-in-it-self-healing-systems-and-smart-incident-response-microsoft-ecosystem-perspective/56Views0likes0CommentsImproper AVD Host Decommissioning – A Practical Governance Framework
Hi everyone, After working with multiple production Azure Virtual Desktop environments, I noticed a recurring issue that rarely gets documented properly: Improper host decommissioning. Scaling out AVD is easy. Scaling down safely is where environments silently drift. Common issues I’ve seen in the field: Session hosts deleted before drain completion Orphaned Entra ID device objects Intune-managed device records left behind Stale registration tokens FSLogix containers remaining locked Defender onboarding objects not cleaned Host pool inconsistencies over time The problem is not technical complexity. It’s lifecycle governance. So I built a structured approach to host decommissioning focused on: Drain validation Active session verification Controlled removal from host pool VM deletion sequencing Identity cleanup validation Registration token rotation Logging and execution safety I’ve published a practical framework here: The framework is fully documented and includes validation logic and logging. https://github.com/modernendpoint/AVD-Host-Decommission-Framework The goal is simple: Not just removing a VM — but preserving platform integrity. I’m curious: How are you handling host lifecycle management in your AVD environments? Fully automated? Manual? Integrated with scaling plans? Identity cleanup included? Would love to hear how others approach this. Menahem Suissa AVD | Intune | Identity-Driven Architecture164Views0likes0CommentsAdmin‑On‑Behalf‑Of issue when purchasing subscription
Hello everyone! I want to reach out to you on the internet and ask if anyone has the same issue as we do when creating PAYG Azure subscriptions in a customer's tenant, in which we have delegated access via GDAP through PartnerCenter. It is a bit AI formatted question. When an Azure NCE subscription is created for a customer via an Indirect Provider portal, the CSP Admin Agent (foreign principal) is not automatically assigned Owner on the subscription. As a result: AOBO (Admin‑On‑Behalf‑Of) does not activate The subscription is invisible to the partner when accessing Azure via Partner Center service links The partner cannot manage and deploy to a subscription they just provided This breaks the expected delegated administration flow. Expected Behavior For CSP‑created Azure subscriptions: The CSP Admin Agent group should automatically receive Owner (or equivalent) on the subscription AOBO should work immediately, without customer involvement The partner should be able to see the subscription in Azure Portal and deploy resources Actual Behavior Observed For Azure NCE subscriptions created via an Indirect Provider: No RBAC assignment is created for the foreign AdminAgent group The subscription is visible only to users inside the customer tenant Partner Center role (Admin Agent foreign group) is present, but without Azure RBAC. Required Customer Workaround For each new Azure NCE subscription, the customer must: Sign in as Global Admin Use “Elevate access to manage all Azure subscriptions and management groups” Assign themselves Owner on the subscription Manually assign Owner to the partner’s foreign AdminAgent group Only after this does AOBO start working. Example Partner tries to access the subscription: https://portal.azure.com/#@customer.onmicrosoft.com/resource/subscriptions/<subscription-id>/overview But there is no subscription visible "None of the entries matched the given filter" https://learn.microsoft.com/en-us/azure/role-based-access-control/elevate-access-global-admin?tabs=azure-portal%2Centra-audit-logs#step-1-elevate-access-for-a-global-administrator from the customer's global admin. and manual RBAC fix in Cloud console: az role assignment create \ --assignee-object-id "<AdminAgent-Foreign-Group-ObjectId>" \ --role "Owner" \ --scope "/subscriptions/<subscription-id>" \ --assignee-principal-type "ForeignGroup" After this, AOBO works as expected for delegated administrators (foreign user accounts). Why This Is a Problem Partners sell Azure subscriptions that they cannot access Forces resources from customers to involvement from customers Breaks delegated administration principles For Indirect CSPs managing many tenants, this is a decent operational blocker. Key Question to Microsoft / Community Does anyone else struggle with this? Is this behavior by design for Azure NCE + Indirect CSP? Am I missing some point of view on why not to do it in the suggested way?70Views0likes0CommentsAzure’s Default Outbound Access Changes: Guidance for Azure Virtual Desktop Customers
After March 31, 2026, newly created Azure Virtual Networks (VNets) will no longer have default outbound internet access (DOA) enabled by default. Azure Virtual Desktop customers must configure outbound connectivity explicitly when setting up new VNets. This post explains what’s changing, who’s impacted, and the recommended actions, including Private Subnets. What is Default Outbound Access (DOA)? Default Outbound Access is Azure’s legacy behavior that allowed all resources in a virtual network to reach the public internet without configuring a specific internet egress path. This allowed telemetry, Windows activation, updates, and other service dependencies to reach external endpoints even when no explicit outbound connectivity method was configured. What’s changing? After March 31, 2026, as detailed in Azure’s communications, Azure will no longer enable DOA by default for new virtual networks. Instead, the VNet will be configured for Private Subnet option, allowing you to designate subnets without internet access for improved isolation and compliance. These changes encourage more intentional, secure network configurations while offering flexibility for different workload needs. Disabling Private Subnet option will allow administrators to restore DOA capabilities to the VNet, although Microsoft strongly recommends using NAT Gateway to provide outbound Internet access for session hosts. Impact on Azure Virtual Desktop Customers For Azure Virtual Desktop deployments created after March 31, 2026, outbound internet access must be explicitly configured, otherwise deployment and connectivity of the Session Hosts will fail. Existing VNets remain unaffected and will continue to use the configured internet access method. What You Should Do To prepare for Azure’s Default Outbound Access changes and ensure your Azure Virtual Desktop deployments remain secure and functional. Recommendations Update deployment plans to ensure either an explicit NAT, such as a NAT Gateway or Default Outbound access (not recommended) is enabled by disabling the Private Subnet option. Test connectivity to ensure all services dependent on outbound access continue to function as expected. Supported Outbound Access Methods To maintain connectivity, choose one of these supported methods: NAT Gateway (recommended) Note: Direct RDP Shortpath (UDP over STUN) cannot be established through a NAT Gateway because its symmetric NAT policy prevents direct UDP connectivity over public networks. Azure Standard Load Balancer Public IP address on a VM Azure Firewall or third-party Network Virtual Appliance (NVA). Note, it is not recommended to route RDP or other long-lived connections through Azure Firewall or any other network virtual appliance which allows for automatic scale-in. A direct method such as NAT Gateway should be used. More information about the pros and cons of each method can be found at Default Outbound Access. Resources: Azure updates | Microsoft Azure Default Outbound Access in Azure Transition to an explicit method of public connectivity| Microsoft Learn Quickstart: Create a NAT Gateway Quick FAQ Does this affect existing VNets? No. Only VNets created after March 31, 2026, are affected. Existing VNets will continue to operate as normal. What if I do nothing on a new VNet? Host pool deployment will fail, and connectivity will fail because the VNet does not have internet access. Configure NAT Gateway or another supported method before starting a host pool deployment. Why do Azure Virtual Desktop session hosts need outbound internet access? Many Azure Virtual Desktop functions depend on the session host having outbound access to Microsoft services. Without configuring NAT Gateway or another supported method of explicit outbound for the VNet, Azure Virtual Desktop will not deploy or function correctly. What are the required endpoints? Please see https://learn.microsoft.com/azure/virtual-desktop/required-fqdn-endpoint?tabs=azure for a list of the endpoints required. Why might peer-to-peer connectivity using STUN-based UDP hole punching not work when using NAT Gateway? NAT Gateway uses a type of network address translation that does not support cone symmetric NAT behavior. This can prevent STUN (Simple Traversal Underneath NAT) based UDP hole punching, commonly used for establishing peer-to-peer connections, from working as expected. If your application relies on reliable UDP connectivity between peers, STUN may revert to TURN (Traversal Using Relays around NAT) in some instances. TURN relays traffic between endpoints, ensuring consistent connectivity even when direct peer-to-peer paths are blocked. This helps maintain smooth real-time experiences for your users. What explicit outbound options support STUN? Azure Standard Load Balancer supports UDP over STUN. How do I configure Azure Firewall? For additional security you can configure Azure Firewall using these instructions: https://learn.microsoft.com/en-us/azure/firewall/protect-azure-virtual-desktop?context=/azure/virtual-desktop/context/context . It is strongly recommended that a direct method of access is used for RDP and other long-lived connections such as VPN or Secure Web Gateway tunnels. This is due to devices such as Azure firewall scaling in when load is low which can disrupt connectivity. Wrap-up Azure’s change reinforces intentional networking for better security. By planning explicit egress, Azure Virtual Desktop customers can stay compliant and keep session hosts reliably connected.1.5KViews1like0CommentsHow to Fix Azure Event Grid Entra Authentication issue for ACS and Dynamics 365 integrated Webhooks
Introduction: Azure Event Grid is a powerful event routing service that enables event-driven architectures in Azure. When delivering events to webhook endpoints, security becomes paramount. Microsoft provides a secure webhook delivery mechanism using Microsoft Entra ID (formerly Azure Active Directory) authentication through the AzureEventGridSecureWebhookSubscriber role. Problem Statement: When integrating Azure Communication Services with Dynamics 365 Contact Center using Microsoft Entra ID-authenticated Event Grid webhooks, the Event Grid subscription deployment fails with an error: "HTTP POST request failed with unknown error code" with empty HTTP status and code. For example: Important Note: Before moving forward, please verify that you have the Owner role assigned on app to create event subscription. Refer to the Microsoft guidelines below to validate the required prerequisites before proceeding: Set up incoming calls, call recording, and SMS services | Microsoft Learn Why This Happens: This happens because AzureEventGridSecureWebhookSubscriber role is NOT properly configured on Microsoft EventGrid SP (Service Principal) and event subscription entra ID or application who is trying to create event grid subscription. What is AzureEventGridSecureWebhookSubscriber Role: The AzureEventGridSecureWebhookSubscriber is an Azure Entra application role that: Enables your application to verify the identity of event senders Allows specific users/applications to create event subscriptions Authorizes Event Grid to deliver events to your webhook How It Works: Role Creation: You create this app role in your destination webhook application's Azure Entra registration Role Assignment: You assign this role to: Microsoft Event Grid service principal (so it can deliver events) Either Entra ID / Entra User or Event subscription creator applications (so they can create event grid subscriptions) Token Validation: When Event Grid delivers events, it includes an Azure Entra token with this role claim Authorization Check: Your webhook validates the token and checks for the role Key Participants: Webhook Application (Your App) Purpose: Receives and processes events App Registration: Created in Azure Entra Contains: The AzureEventGridSecureWebhookSubscriber app role Validates: Incoming tokens from Event Grid Microsoft Event Grid Service Principal Purpose: Delivers events to webhooks App ID: Different per Azure cloud (Public, Government, etc.) Public Azure: 4962773b-9cdb-44cf-a8bf-237846a00ab7 Needs: AzureEventGridSecureWebhookSubscriber role assigned Event Subscription Creator Entra or Application Purpose: Creates event subscriptions Could be: You, Your deployment pipeline, admin tool, or another application Needs: AzureEventGridSecureWebhookSubscriber role assigned Although the full PowerShell script is documented in the below Event Grid documentation, it may be complex to interpret and troubleshoot. Azure PowerShell - Secure WebHook delivery with Microsoft Entra Application in Azure Event Grid - Azure Event Grid | Microsoft Learn To improve accessibility, the following section provides a simplified step-by-step tested solution along with verification steps suitable for all users including non-technical: Steps: STEP 1: Verify/Create Microsoft.EventGrid Service Principal Azure Portal → Microsoft Entra ID → Enterprise applications Change filter to Application type: Microsoft Applications Search for: Microsoft.EventGrid Ideally, your Azure subscription should include this application ID, which is common across all Azure subscriptions: 4962773b-9cdb-44cf-a8bf-237846a00ab7. If this application ID is not present, please contact your Azure Cloud Administrator. STEP 2: Create the App Role "AzureEventGridSecureWebhookSubscriber" Using Azure Portal: Navigate to your Webhook App Registration: Azure Portal → Microsoft Entra ID → App registrations Click All applications Find your app by searching OR use the Object ID you have Click on your app Create the App Role: Display name: AzureEventGridSecureWebhookSubscriber Allowed member types: Both (Users/Groups + Applications) Value: AzureEventGridSecureWebhookSubscriber Description: Azure Event Grid Role Do you want to enable this app role?: Yes In left menu, click App roles Click + Create app role Fill in the form: Click Apply STEP 3: Assign YOUR USER to the Role Using Azure Portal: Switch to Enterprise Application view: Azure Portal → Microsoft Entra ID → Enterprise applications Search for your webhook app (by name) Click on it Assign yourself: In left menu, click Users and groups Click + Add user/group Under Users, click None Selected Search for your user account (use your email) Select yourself Click Select Under Select a role, click None Selected Select AzureEventGridSecureWebhookSubscriber Click Select Click Assign STEP 4: Assign Microsoft.EventGrid Service Principal to the Role This step MUST be done via PowerShell or Azure CLI (Portal doesn't support this directly as we have seen) so PowerShell is recommended You will need to execute this step with the help of your Entra admin. # Connect to Microsoft Graph Connect-MgGraph -Scopes "AppRoleAssignment.ReadWrite.All" # Replace this with your webhook app's Application (client) ID $webhookAppId = "YOUR-WEBHOOK-APP-ID-HERE" #starting with c5 # Get your webhook app's service principal $webhookSP = Get-MgServicePrincipal -Filter "appId eq '$webhookAppId'" Write-Host " Found webhook app: $($webhookSP.DisplayName)" # Get Event Grid service principal $eventGridSP = Get-MgServicePrincipal -Filter "appId eq '4962773b-9cdb-44cf-a8bf-237846a00ab7'" Write-Host " Found Event Grid service principal" # Get the app role $appRole = $webhookSP.AppRoles | Where-Object {$_.Value -eq "AzureEventGridSecureWebhookSubscriber"} Write-Host " Found app role: $($appRole.DisplayName)" # Create the assignment New-MgServicePrincipalAppRoleAssignment ` -ServicePrincipalId $eventGridSP.Id ` -PrincipalId $eventGridSP.Id ` -ResourceId $webhookSP.Id ` -AppRoleId $appRole.Id Write-Host "Successfully assigned Event Grid to your webhook app!" Verification Steps: Verify the App Role was created: Your App Registration → App roles You should see: AzureEventGridSecureWebhookSubscriber Verify your user assignment: Enterprise application (your webhook app) → Users and groups You should see your user with role AzureEventGridSecureWebhookSubscriber Verify Event Grid assignment: Same location → Users and groups You should see Microsoft.EventGrid with role AzureEventGridSecureWebhookSubscriber Sample Flow: Analogy For Simplification: Lets think it similar to the construction site bulding where you are the owner of the building. Building = Azure Entra app (webhook app) Building (Azure Entra App Registration for Webhook) ├─ Building Name: "MyWebhook-App" ├─ Building Address: Application ID ├─ Building Owner: You ├─ Security System: App Roles (the security badges you create) └─ Security Team: Azure Entra and your actual webhook auth code (which validates tokens) like doorman Step 1: Creat the badge (App role) You (the building owner) create a special badge: - Badge name: "AzureEventGridSecureWebhookSubscriber" - Badge color: Let's say it's GOLD - Who can have it: Companies (Applications) and People (Users) This badge is stored in your building's system (Webhook App Registration) Step 2: Give badge to the Event Grid Service: Event Grid: "Hey, I need to deliver messages to your building" You: "Okay, here's a GOLD badge for your SP" Event Grid: *wears the badge* Now Event Grid can: - Show the badge to Azure Entra - Get tokens that say "I have the GOLD badge" - Deliver messages to your webhook Step 3: Give badge to yourself (or your deployment tool) You also need a GOLD badge because: - You want to create event grid event subscriptions - Entra checks: "Does this person have a GOLD badge?" - If yes: You can create subscriptions - If no: "Access denied" Your deployment pipeline also gets a GOLD badge: - So it can automatically set up event subscriptions during CI/CD deployments Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for this specific integration use case 😀268Views3likes0CommentsIntegrate Agents with Skills in Github Copilot
The past year saw the rise of Agentic workflows. Agents have a task or goal to accomplish and build context, take actions using tools. Tools while affective in surfacing the requisite sources and actions can easily increase in numbers causing context bloat, high token consumption. Agent Skills was proposed in a recent Anthropic paper to address the above challenges. Agent Skills are now supported in Visual Studio Code (Experimental) and can be used with Github Copilot. It works across Copilot coding agent, Copilot CLI, and agent mode in Visual Studio Code Insiders. Copilot coding agent is available with the GitHub Copilot Pro, GitHub Copilot Pro+, GitHub Copilot Business and GitHub Copilot Enterprise plans. The agent is available in all repositories stored on GitHub, except repositories owned by managed user accounts and where it has been explicitly disabled. An Agent Skill is created to teach Copilot on performing specialized tasks with detailed instructions while also being repeatable. At its core, Agent Skills are folders which contain instructions, scripts, and resources that the Copilot automatically loads when relevant to the query. On receiving a prompt, Copilot determines if a skill is relevant to your task and it then loads the instructions. The skills instructions are executed along with any resources included in the directory structure relevant to the specific skill. One guideline would be to encapsulate into a skill anything which is being done repeatedly. In the example below, we have a skill for creating a github issue for a feature request using a specific template (the template will be referenced by the skill based on the type of issue to be created). The SKILL.md file is very detailed in all the instructions required for supporting multiple github issues related actions. The description is key to understanding the Skill and when the Agent requires a specific Skill, the appropriate instructions are loaded. The loaded Skill is then executed in a secure code execution environment. A further option provided by Agent Skills is reusing the generated code by storing it in the filesystem to avoid repeated execution. In Visual Studio Code, enable the "chat.useAgentSkills" setting to use Agent Skills prior to the run. An Agent can have nested agents which is used to detail sub agents (Nested Agents is also enabled in settings as shown below) and thus decouple functionality. Any prompt in the chat will now have the option to pick from the Agent Skills in addition to the tools available. We can write our own skills, or use those which are shared by others - anthropics/skills repository or GitHub’s community created github/awesome-copilot collection. While skills are very powerful, using shared skills needs to be done with discretion and from a security perspective only use skills shared by trusted sources. Resources https://github.blog/changelog/2025-12-18-github-copilot-now-supports-agent-skills/ https://code.visualstudio.com/docs/copilot/customization/agent-skills789Views0likes0CommentsAzure Migrate Physical Server Discovery - ServerDiscoveryService.exe Crash Bug
Summary The Azure Migrate appliance for physical server discovery fails to complete discovery due to a crash bug in ServerDiscoveryService.exe. The service successfully connects to target servers but crashes during WSMan transport cleanup before any discovery data is collected. Environment Appliance OS: Windows Server 2022 Standard Evaluation (Build 20348) Appliance Type: Physical server discovery (script-based installation) ServerDiscoveryService.exe Version: 2.0.3300.663 .NET Version: 8.0.22 (CoreCLR 8.0.2225.52707) Target Servers: Windows Server (various) and Linux, all on-premises Discovery Agent Version: 2.0.03300.663 Appliance Configuration Manager Version: 6.1.294.1847 Symptoms Target server validation succeeds in the appliance configuration manager CIM sessions connect successfully (logs show "TestConnection succeeded for CIM Session with HTTP protocol") Connections are immediately disposed with "Disposing all connections when the process is shutdown" No discovery data is collected Azure portal shows error 60001 with misleading "Could not load file or assembly 'Microsoft.Management.Infrastructure'" message Discovery status remains "Discovery Incomplete" for all Windows servers Root Cause The ServerDiscoveryService.exe process crashes repeatedly with an unhandled NullReferenceException in the WSMan transport finalizer. This is visible in the Windows Application Event Log: Application: ServerDiscoveryService.exe CoreCLR Version: 8.0.2225.52707 .NET Version: 8.0.22 Description: The process was terminated due to an unhandled exception. Exception Info: System.NullReferenceException: Object reference not set to an instance of an object. at System.Management.Automation.Remoting.Client.BaseClientTransportManager.CloseAsync() at System.Management.Automation.Remoting.Client.WSManClientSessionTransportManager.CloseAsync() at System.Management.Automation.Remoting.Client.BaseClientTransportManager.Finalize() The crash also triggers an access violation: Faulting application name: ServerDiscoveryService.exe, version: 2.0.3300.663 Exception code: 0xc0000005 Faulting application path: C:\Program Files\Microsoft Azure Server Discovery Service\ServerDiscoveryService.exe These crashes occur approximately every 10 minutes. Troubleshooting Completed Verified manual connectivity works: PowerShell Invoke-Command and New-CimSession both succeed from the appliance to target servers using the same credentials Verified WinRM configuration: Targets have WinRM HTTP listener on port 5985, LocalAccountTokenFilterPolicy is set to 1 Verified assemblies exist: Microsoft.Management.Infrastructure.dll is present in the GAC on both the appliance and target servers Tested both FQDNs and IP addresses: Same failure occurs with both Tested both local and domain credentials: Same failure with properly formatted credentials (domain\user) Verified time synchronization: Appliance clock is accurate Verified appliance is up to date: All components show current versions Tested with fresh appliance: Previously tried OVA-based appliance with similar results; rebuilt using Microsoft's PowerShell script installer on clean Server 2022—same issue Relevant Log Locations C:\ProgramData\Microsoft Azure\Logs\ConfigManager\ClientOperations_*.log - Shows successful CIM connections followed by immediate disposal C:\ProgramData\Microsoft Azure\Logs\ConfigManager\ApplianceOnboarding-Portal-*.log - Shows error 60000 "UnhandledException" with message "Internal error occured." (note: typo is in original) Windows Event Log (Application) - Contains the actual crash stack traces Conclusion This is a code defect in ServerDiscoveryService.exe—a null reference exception in a finalizer is a programming error that cannot be caused by configuration or environmental factors. The service connects successfully but crashes before completing its work. Request Please escalate to the Azure Migrate engineering team for a bug fix in ServerDiscoveryService.exe version 2.0.3300.663.91Views0likes0CommentsAdvanced Container Apps Networking: VNet Integration and Centralized Firewall Traffic Logging
Azure community, I recently documented a networking scenario relevant to Azure Container Apps environments where you need to control and inspect application traffic using a third-party network virtual appliance. The article walks through a practical deployment pattern: • Integrate your Azure Container Apps environment with a Virtual Network. • Configure user-defined routes (UDRs) so that traffic from your container workloads is directed toward a firewall appliance before reaching external networks or backend services. • Verify actual traffic paths using firewall logs to confirm that routing policies are effective. This pattern is helpful for organizations that must enforce advanced filtering, logging, or compliance checks on container egress/ingress traffic, going beyond what native Azure networking controls provide. It also complements Azure Firewall and NSG controls by introducing a dedicated next-generation firewall within your VNet. If you’re working with network control, security perimeters, or hybrid network architectures involving containerized workloads on Azure, you might find it useful. Read the full article on my blog91Views0likes0CommentsApplying DevOps Principles on Lean Infrastructure. Lessons From Scaling to 102K Users.
Hi Azure Community, I'm a Microsoft Certified DevOps Engineer, and I want to share an unusual journey. I have been applying DevOps principles on traditional VPS infrastructure to scale to 102,000 users with 99.2% uptime. Why am I posting this in an Azure community? Because I'm planning migration to Azure in 2026, and I want to understand: What mistakes am I already making that will bite me during migration? THE CURRENT SETUP Platform: Social commerce (West Africa) Users: 102,000 active Monthly events: 2 million Uptime: 99.2% Infrastructure: Single VPS Stack: PHP/Laravel, MySQL, Redis Yes - one VPS. No cloud. No Kubernetes. No microservices. WHY I HAVEN'T USED AZURE YET Honest answer: Budget constraints in emerging market startup ecosystem. At our current scale, fully managed Azure services would significantly increase monthly burn before product-market expansion. The funding we raised needs to last through growth milestones. The trade: I manually optimize what Azure would auto-scale. I debug what Application Insights would catch. I do by hand what Azure Functions would automate. DEVOPS PRACTICES THAT KEPT US RUNNING Even on single-server infrastructure, core DevOps principles still apply: CI/CD Pipeline (GitHub Actions) • 3-5 deployments weekly • Zero-downtime deploys • Automated rollback on health check failures • Feature flags for gradual rollouts Monitoring & Observability • Custom monitoring (would love Application Insights) • Real-time alerting • Performance tracking and slow query detection • Resource usage monitoring Automation • Automated backups • Automated database optimization • Automated image compression • Automated security updates Infrastructure as Code • Configs in Git • Deployment scripts • Environment variables • Documented procedures Testing & Quality • Automated test suite • Pre-deployment health checks • Staging environment • Post-deployment verification KEY OPTIMIZATIONS Async Job Processing • Upload endpoint: 8 seconds → 340ms • 4x capacity increase Database Optimization • Feed loading: 6.4 seconds → 280ms • Strategic caching • Batch processing Image Compression • 3-8MB → 180KB (94% reduction) • Critical for mobile users Caching Strategy • Redis for hot data • Query result caching • Smart invalidation Progressive Enhancement • Server-rendered pages • 2-3 second loads on 4G WHAT I'M WORRIED ABOUT FOR AZURE MIGRATION This is where I need your help: Architecture Decisions • App Service vs Functions + managed services? • MySQL vs Azure SQL? • When does cost/benefit flip for managed services? Cost Management • How do startups manage Azure costs during growth? • Reserved instances vs pay-as-you-go? • Which Azure services are worth the premium? Migration Strategy • Lift-and-shift first, or re-architect immediately? • Zero-downtime migration with 102K active users? • Validation approach before full cutover? Monitoring & DevOps • Application Insights - worth it from day one? • Azure DevOps vs GitHub Actions for Azure deployments? • Operational burden reduction with managed services? Development Workflow • Local development against Azure services? • Cost-effective staging environments? • Testing Azure features without constant bills? MY PLANNED MIGRATION PATH Phase 1: Hybrid (Q1 2026) • Azure CDN for static assets • Azure Blob Storage for images • Application Insights trial • Keep compute on VPS Phase 2: Compute Migration (Q2 2026) • App Service for API • Azure Database for MySQL • Azure Cache for Redis • VPS for background jobs Phase 3: Full Azure (Q3 2026) • Azure Functions for processing • Full managed services • Retire VPS QUESTIONS FOR THIS COMMUNITY Question 1: Am I making migration harder by waiting? Should I have started with Azure at higher cost to avoid technical debt? Question 2: What will break when I migrate? What works on VPS but fails in cloud? What assumptions won't hold? Question 3: How do I validate before cutting over? Parallel infrastructure? Gradual traffic shift? Safe patterns? Question 4: Cost optimization from day one? What to optimize immediately vs later? Common cost mistakes? Question 5: DevOps practices that transfer? What stays the same? What needs rethinking for cloud-native? THE BIGGER QUESTION Have you migrated from self-hosted to Azure? What surprised you? I know my setup isn't best practice by Azure standards. But it's working, and I've learned optimization, monitoring, and DevOps fundamentals in practice. Will those lessons transfer? Or am I building habits that cloud will expose as problematic? Looking forward to insights from folks who've made similar migrations. --- About the Author: Microsoft Certified DevOps Engineer and Azure Developer. CTO at social commerce platform scaling in West Africa. Preparing for phased Azure migration in 2026. P.S. I got the Azure certifications to prepare for this migration. Now I need real-world wisdom from people who've actually done it!113Views0likes0Comments
Events
If your organization has an Azure cloud commitment, Microsoft Marketplace can be a powerful tool for optimizing how that spend is used. Tune in to explore how your organization can leverage its Azure...
Wednesday, Apr 29, 2026, 08:30 AM PDTOnline
0likes
4Attendees
0Comments
Recent Blogs
- Customers managing large fleets of Azure Arc servers need a scalable way to ensure the Azure Arc agent stays up to date without manual intervention. Per server configuration does not scale, and gaps ...Apr 03, 202648Views0likes0Comments
- 12 MIN READReference architecture and runbook (Part 1: HTTP-only) for Hub-Spoke networking with private Application Gateway (AGIC), Azure Firewall DNAT, and Azure Front Door Premium (WAF) 0. When and Why ...Apr 03, 202676Views1like0Comments