azure
2384 TopicsExcited to share my latest open-source project: KubeCost Guardian
After seeing how many DevOps teams struggle with Kubernetes cost visibility on Azure, I built a full-stack cost optimization platform from scratch. ๐ช๐ต๐ฎ๐ ๐ถ๐ ๐ฑ๐ผ๐ฒ๐: โ Real-time AKS cluster monitoring via Azure SDK โ Cost breakdown per namespace, node, and pod โ AI-powered recommendations generated from actual cluster state โ One-click optimization actions โ JWT-secured dashboard with full REST API ๐ง๐ฒ๐ฐ๐ต ๐ฆ๐๐ฎ๐ฐ๐ธ: - React 18 + TypeScript + Vite - Tailwind CSS + shadcn/ui + Recharts - Node.js + Express + TypeScript - Azure SDK (@azure/arm-containerservice) - JWT Authentication + Azure Service Principal ๐ช๐ต๐ฎ๐ ๐บ๐ฎ๐ธ๐ฒ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐: Most cost tools show you generic estimates. KubeCost Guardian reads your actual VM size, node count, and cluster configuration to generate recommendations that are specific to your infrastructure not averages. For example, if your cluster has only 2 nodes with no autoscaler enabled, it immediately flags the HA risk and calculates exactly how much you'd save by switching to Spot instances based on your actual VM size. This project is fully open-source and built for the DevOps community. โญ GitHub: https://github.com/HlaliMedAmine/kubecost-guardian This project represents hours of hard work, and passion. I decided to make it open-source so everyone can benefit from it ๐ค ,If you find it useful, Iโd really appreciate your support . Your support motivates me to keep building and sharing more powerful projects ๐. More exciting ideas are coming soonโฆ stay tuned! ๐ฅ.24Views0likes0CommentsBuilding a Production-Ready Azure Lighthouse Deployment Pipeline with EPAC
Recently I worked on an interesting project for an end-to-end Azure Lighthouse implementation. What really stood out to me was the combination of Azure Lighthouse, EPAC, DevOps, and workload identity federation. The deployment model was so compelling that I decided to build and validate the full solution hands-on in my own personal Azure tenants. The result is a detailed article that documents the entire journey, including pipeline design, implementation steps, and the scripts I prepared along the way. You can read the full article here44Views0likes1CommentPipeline Intelligence is live and open-source real-time Azure DevOps monitoring powered by AI .
Every DevOps team I've worked with had the same problem: Slow pipelines. Zero visibility. No idea where to start. So I stopped complaining and built the solution. So I built something about it. โก Pipeline Intelligence is a full-stack Azure DevOps monitoring dashboard that: โ Connects to your real Azure DevOps organization via REST API โ Detects bottlenecks across all your pipelines automatically โ Calculates exactly how much time your team is wasting per month โ Uses Gemini AI to generate prioritized fixes with ready-to-paste YAML solutions โ JWT-secured, Docker-ready, and fully open-source Tech Stack: โ React 18 + Vite + Tailwind CSS โ Node.js + Express + Azure DevOps API v7 โ Google Gemini 1.5 Flash โ JWT Authentication + Docker ๐ช๐ต๐ฎ๐ ๐บ๐ฎ๐ธ๐ฒ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐? Most tools show you generic estimates. Pipeline Intelligence reads your actual cluster config, node count, and pipeline structure and gives you recommendations specific to your infrastructure. ๐ฏ This year, I set myself a personal challenge: Build and open-source a series of production-grade tools exclusively focused on Azure services tools that solve real problems for real DevOps teams. This project represents weeks of research, architecture decisions, and late-night debugging sessions. I'm sharing it with the community because I believe great tooling should be accessible to everyone not locked behind enterprise paywalls. If this resonates with you, I have one simple ask: ๐ A like, a comment, or a share takes 3 seconds but it helps this reach the DevOps engineers who need it most. Your support is what keeps me building. โค๏ธ GitHub: https://github.com/HlaliMedAmine/pipeline-intelligence28Views0likes0CommentsAzure support team not responding to support request
I am posting here because I have not received a response to my support request despite my plan stating that I should hear back within 8 hours. It has now gone a day beyond that limit, and I am still waiting for assistance with this urgent matter. This issue is critical for my operations, and the delay is unacceptable. The ticket/reference number for my original support request was 2410100040000309. And I have created a brand new service request with ID 2412160040010160. I need this addressed immediately.803Views1like8CommentsHow to Use the Azure Pricing Calculator Effectively โ A Step-byโStep Guide
When youโre planning to move workloads to Microsoft Azure, one of the first questions that comes up is simple but important: How much is this going to cost? Cloud pricing can be tricky. Between different regions, service tiers, storage options, and licensing models, itโs easy to underestimate or overestimate costs. Thankfully, Microsoft provides a free tool called the Azure Pricing Calculator to help you get a clear, customized cost estimate before you deploy anything. In this guide, weโll walk through how to use the calculator effectively, the best practices for accurate estimates, and a few tips that can help you plan your Azure budget with confidence. https://dellenny.com/how-to-use-the-azure-pricing-calculator-effectively-a-step-by%e2%80%90step-guide/171Views0likes1CommentAzure Key Vault Replication: Why Paired Regions Alone Donโt Guarantee Business Continuity
As customers modernize toward multiโregion architectures in Azure, one question comes up repeatedly: โIf my region goes down, will Azure Key Vault continue to work without disruption?โ The short answer: it depends on what you mean by โwork.โ Azure Key Vault provides strong durability and availability guarantees, but those guarantees are often misunderstoodโespecially when customers assume pairedโregion replication equals full disaster recovery. In reality, Azure Key Vault replication is designed for survivability, not uninterrupted write access or customerโcontrolled failover. This post explains: How Azure Key Vault replication actually works (per Microsoft Learn) Why pairedโregion failover does not equal business continuity Two reference architectures that implement true multiโregion Key Vault availability, with Terraform How Azure Key Vault Replication Works (Per Microsoft Learn) Azure Key Vault includes multiple layers of Microsoftโmanaged redundancy. InโRegion and Zone Resiliency Vault contents are replicated within the region. In regions that support availability zones, Key Vault is zoneโresilient by default. This protects against localized hardware or zone failures. PairedโRegion Replication If a Key Vault is deployed in a region with an Azureโdefined paired region, its contents are asynchronously replicated to that paired region. This replication is automatic and cannot be configured, observed, or tested by customers. MicrosoftโManaged Regional Failover If Microsoft declares a full regional outage, requests are automatically routed to the paired region. After failover, the vault operates in readโonly mode: โ Read secrets, keys, and certificates โ Perform cryptographic operations โ Create, update, rotate, or delete secrets, keys, or certificates This is a critical distinction. Pairedโregion replication preserves access โ not operational continuity. Why PairedโRegion Replication Is Not Business Continuity From a reliability and DR perspective, several limitations matter: Failover is Microsoftโinitiated, not customerโcontrolled No write operations during regional failover No secret rotation or certificate renewal No way to test DR Accidental deletions replicate No pointโinโtime recovery without backups Microsoft Learn explicitly states that critical workloads may require custom multiโregion strategies beyond builtโin replication. For many customers, this means Azure Key Vault becomes a singleโregion dependency in an otherwise multiโregion application design. The MultiโRegion Key Vault Pattern The two GitHub repositories below implement a common architectural shift: Multiple independent Key Vaults deployed in separate regions, with customerโcontrolled replication and failover. Instead of relying on invisible platform replication, the vaults become firstโclass, regionโscoped resources, aligned with application failover. Solution 1: Private, LockedโDown MultiโRegion Key Vault Replication Repository: ๐ https://github.com/jclem2000/KeyVault-MultiRegion-Replication-Private Architecture Highlights Independent Key Vault per region Private Endpoints only No public network exposure Terraformโbased deployment Controlled replication using Event Based synchronization What This Enables โ Full read/write access during regional outages โ Continued secret rotation and certificate renewal โ Customerโdefined failover and RTO โ DR testing and validation โ Strong alignment with zeroโtrust and regulated environments Tradeโoffs Higher operational complexity Requires automation and application awareness of multiple vaults Solution 2: LowโCost Public MultiโRegion Key Vault Replication Repository: ๐ https://github.com/jclem2000/KeyVault-MultiRegion-Replication-Public Architecture Highlights Independent Key Vault per region Public endpoints Minimal networking dependencies Terraformโbased Controlled replication using Event Based synchronization Optimized for simplicity and cost What This Enables โ Full read/write availability in any region โ Clear and testable DR posture โ Lower cost than private endpoint designs โ Suitable for many nonโregulated workloads Tradeโoffs Public exposure (mitigated via firewall rules, RBAC, and conditional access) Not appropriate for all compliance requirements Requires automation and application awareness of multiple vaults Azure Native Replication vs CustomerโManaged MultiโRegion Vaults Capability Azure Paired Region MultiโRegion Vaults Read access during outage โ โ Write access during outage โ โ Secret rotation during outage โ โ Customerโcontrolled failover โ โ DR testing โ โ Isolation from accidental deletion โ โ Predictable RTO โ โ Azure Key Vaultโs native replication optimizes for platform durability. The multiโregion pattern optimizes for application continuity. When to Use Each Approach PairedโRegion Replication Is Often Enough When: Secrets are mostly static Readโonly access during outages is acceptable RTO is flexible You prefer Microsoftโmanaged recovery MultiโRegion Vaults Are Recommended When: Secrets or certificates rotate frequently Applications must remain writable during outages Deterministic failover is required DR testing is mandatory Regulatory or operational isolation is needed Closing Thoughts Azure Key Vault behaves exactly as documented on Microsoft Learnโbut itโs important to be clear about what those guarantees mean. Pairedโregion replication protects your data, not your ability to operate. If your application is designed to survive a regional outage, Key Vault must follow the same multiโregion design principles as the application itself. The reference architectures above show how to extend Azureโs native durability model into true operational resilience, without waiting for a platformโlevel failover decision.131Views0likes0CommentsRecovering our Default Azure Directory
Hello, everyone, relative newcomer to Azure here. I'm dealing with an inherited situation and, to add to the fun, I've just discovered my organization only has a Basic support plan, so no access to Azure technical support. I'm hoping some knowledgeable souls on here are in a charitable mood and will point me in the right direction. We're having problems getting to our DNS subscription because it's locked away behind an Azure directory to which we don't seem to have access, and I'm not quite sure this is completely an Azure problem. I was able to get into this directory around a year ago but I so seldomly access it that I'm not sure when this changed. We have two Azure directories. One is our "regular" directory, named for our organization, and it's linked (not sure of the terminology here) to our domain. Let's call it This.Domain.com. There are no subscriptions in this directory. The other is named "Default Directory" and it's linked to an onmicrosoft domain -- let's call it OldAdminThisDomain.onmicrosoft.com. When I try to switch to this directory I'm prompted to log in, then I'm hit with the MFA prompt. This is normally not a problem but it's like the MFA was set up for a different account with the same email address. By contrast, I can log into both the regular Azure directory and the 365 admin page with no problem -- I type in my email address (let's call it email address removed for privacy reasons), MFA comes up, and I have several authentication methods to choose from: Microsoft's MFA app, SMS, email, YubiKey, phone, etc., and all these options work. When trying to log into the Azure Default Directory, however, the MFA acknowledges only either the Microsoft Authenticator app or Use a Verification Code (which also goes through the Microsoft Authenticator app), and neither option yields any prompt on my phone. I seem to recall I effectively had two different "accounts" that somehow used the same email address but had different MFA setups, but again this was around a year and 3 phones ago so I don't have a solid memory of what was happening. I am also aware that, while this should not be permittable, there have been several cases where multiple Microsoft accounts were somehow created using the same email address. So this is where I am. Ideally we could merge the two Azure directories so that we combine the accessibility of the "regular" directory with the subscription(s?) that are in the Default Directory. Barring that I would have to somehow get the (suspected) two Microsoft accounts based on the email address removed for privacy reasons email address corrected. Any help would be greatly appreciated. Thanks to all in advance52Views0likes1CommentBuilding Multi-Agent Orchestration Using Microsoft Semantic Kernel: A Complete Step-by-Step Guide
What You Will Build By the end of this guide, you will have a working multi-agent system where 4 specialist AI agents collaborate to diagnose production issues: ClientAnalyst โ Analyzes browser, JavaScript, CORS, uploads, and UI symptoms NetworkAnalyst โ Analyzes DNS, TCP/IP, TLS, load balancers, and firewalls ServerAnalyst โ Analyzes backend logs, database, deployments, and resource limits Coordinator โ Synthesizes all findings into a root cause report with a prioritized action plan These agents don't just run in sequence โ they debate, cross-examine, and challenge each other's findings through a shared conversation, producing a diagnosis that's better than any single agent could achieve alone. Table of Contents Why Multi-Agent? The Problem with Single Agents Architecture Overview Understanding the Key SK Components The Actor Model โ How InProcessRuntime Works Setting Up Your Development Environment Step-by-Step: Building the Multi-Agent Analyzer The Agent Interaction Flow โ Round by Round Bugs I Found & Fixed โ Lessons Learned Running with Different AI Providers What to Build Next 1. Why Multi-Agent? The Problem with Single Agents A single AI agent analyzing a production issue is like having one doctor diagnose everything โ they'll catch issues in their specialty but miss cross-domain connections. Consider this problem: "Users report 504 Gateway Timeout errors when uploading files larger than 10MB. Started after Friday's deployment. Worse during peak hours." A single agent might say "it's a server timeout" and stop. But the real root cause often spans multiple layers: The client is sending chunked uploads with an incorrect Content-Length header (client-side bug) The load balancer has a 30-second timeout that's too short for large uploads (network config) The server recently deployed a new request body parser that's 3x slower (server-side regression) The combination only fails during peak hours because connection pool saturation amplifies the latency No single perspective catches this. You need specialists who analyze independently, then debate to find the cross-layer causal chain. That's what multi-agent orchestration gives you. The 5 Orchestration Patterns in SK Semantic Kernel provides 5 built-in patterns for agent collaboration: SEQUENTIAL: A โ B โ C โ Done (pipeline โ each builds on previous) CONCURRENT: โ A โ Task โ B โ Aggregate โ C โ (parallel โ results merged) GROUP CHAT: A โ B โ C โ D โ We use this one (rounds, shared history, debate) HANDOFF: A โ (stuck?) โ B โ (complex?) โ Human (escalation with human-in-the-loop) MAGENTIC: LLM picks who speaks next dynamically (AI-driven speaker selection) We use GroupChatOrchestration with RoundRobinGroupChatManager because our problem requires agents to see each other's work, challenge assumptions, and build on each other's analysis across two rounds. 2. Architecture Overview Here's the complete architecture of what we're building: 3. Understanding the Key SK Components Before we write code, let's understand the 5 components we'll use and the design pattern each implements: ChatCompletionAgent โ Strategy Pattern The agent definition. Each agent is a combination of: name โ unique identifier (used in round-robin ordering) instructions โ the persona and rules (this is the prompt engineering) service โ which AI provider to call (Strategy Pattern โ swap providers without changing agent logic) description โ what other agents/tools understand about this agent agent = ChatCompletionAgent( name="ClientAnalyst", instructions="You are ONLY ClientAnalyst...", service=gemini_service, # โ Strategy: swap to OpenAI with zero changes description="Analyzes client-side issues", ) GroupChatOrchestration โ Mediator Pattern The orchestration defines HOW agents interact. It's the Mediator โ agents don't talk to each other directly. Instead, the orchestration manages a shared ChatHistory and routes messages through the Manager. RoundRobinGroupChatManager โ Strategy Pattern The Manager decides WHO speaks next. RoundRobinGroupChatManager cycles through agents in a fixed order. SK also provides AutomaticGroupChatManager where the LLM decides who speaks next. max_rounds is the total number of messages per agent or cycle. With 4 agents and max_rounds=8, each agent speaks exactly twice. InProcessRuntime โ Actor Model Abstraction The execution engine. Every agent becomes an "actor" with its own kind of mailbox (message queue). The runtime delivers messages between actors. Key properties: No shared state โ agents communicate only through messages Sequential processing โ each agent processes one message at a time Location transparency โ same code works in-process today, distributed tomorrow agent_response_callback โ Observer Pattern A function that fires after EVERY agent response. We use it to display each agent's output in real-time with emoji labels and round numbers. 4. The Actor Model โ How InProcessRuntime Works The Actor Model is a concurrency pattern where each entity is an isolated "actor" with a private mailbox. Here's what happens inside InProcessRuntime when we run our demo: runtime.start() โ โโโ Creates internal message loop (asyncio event loop) โ orchestration.invoke(task="504 timeout...", runtime=runtime) โ โโโ Creates Actor[Orchestrator] โ manages overall flow โโโ Creates Actor[Manager] โ RoundRobinGroupChatManager โโโ Creates Actor[ClientAnalyst] โ mailbox created, waiting โโโ Creates Actor[NetworkAnalyst] โ mailbox created, waiting โโโ Creates Actor[ServerAnalyst] โ mailbox created, waiting โโโ Creates Actor[Coordinator] โ mailbox created, waiting Manager receives "start" message โ โโโ Checks turn order: [Client, Network, Server, Coordinator] โโโ Sends task to ClientAnalyst mailbox โ โ ClientAnalyst processes: calls LLM โ response โ โ Response added to shared ChatHistory โ โ callback fires (displayed in Notebook UI) โ โ Sends "done" back to Manager โ โโโ Manager updates: turn_index=1 โโโ Sends to NetworkAnalyst mailbox โ โ Same flow... โ โโโ ... (ServerAnalyst, Coordinator for Round 1) โ โโโ Manager checks: messages=4, max_rounds=8 โ continue โ โโโ Round 2: same cycle with cross-examination โ โโโ After message 8: Manager sends "complete" โ OrchestrationResult resolves โ result.get() returns final answer runtime.stop_when_idle() โ All mailboxes empty โ clean shutdown The Actor Model guarantees: No race conditions (each actor processes one message at a time) No deadlocks (no shared locks to contend for) No shared mutable state (agents communicate only via messages) 5. Setting Up Your Development Environment Prerequisites Python 3.11 or 3.12 (3.13+ may have compatibility issues with some SK connectors) Visual Studio Code with the Python and Jupyter extensions An API key from one of: Google AI Studio (free), OpenAI Step 1: Install Python Download from python.org. During installation, check "Add Python to PATH". Verify: python --version # Python 3.12.x Step 2: Install VS Code Extensions Open VS Code, go to Extensions (Ctrl+Shift+X), and install: Python (by Microsoft) โ Python language support Jupyter (by Microsoft) โ Notebook support Pylance (by Microsoft) โ IntelliSense and type checking Step 3: Create Project Folder mkdir sk-multiagent-demo cd sk-multiagent-demo Open in VS Code: code . Step 4: Create Virtual Environment Open the VS Code terminal (Ctrl+`) and run: # Create virtual environment python -m venv sk-env # Activate it # Windows: sk-env\Scripts\activate # macOS/Linux: source sk-env/bin/activate You should see (sk-env) in your terminal prompt. Step 5: Install Semantic Kernel For Google Gemini (free tier โ recommended for getting started): pip install semantic-kernel[google] python-dotenv ipykernel For OpenAI (paid API key): pip install semantic-kernel openai python-dotenv ipykernel For Azure AI Foundry (enterprise, Entra ID auth): pip install semantic-kernel azure-identity python-dotenv ipykernel Step 6: Register the Jupyter Kernel python -m ipykernel install --user --name=sk-env --display-name="Semantic Kernel (Python 3.12)" You can also select if this is already available from your environment from VSCode as below: Step 7: Get Your API Key Option A โ Google Gemini (FREE, recommended for demo): Go to https://aistudio.google.com/apikey Click "Create API Key" Copy the key Free tier limits: 15 requests/minute, 1 million tokens/minute โ more than enough for this demo. Option B โ OpenAI: Go to https://platform.openai.com/api-keys Create a new key Copy the key Option C โ Azure AI Foundry: Deploy a model in Azure AI Foundry portal Note the endpoint URL and deployment name If key-based auth is disabled, you'll need Entra ID with permissions Step 8: Create the .env File In your project root, create a file named .env: For Gemini: GOOGLE_AI_API_KEY=AIzaSy...your-key-here GOOGLE_AI_GEMINI_MODEL_ID=gemini-2.5-flash For OpenAI: OPENAI_API_KEY=sk-...your-key-here OPENAI_CHAT_MODEL_ID=gpt-4o For Azure AI Foundry: AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4o AZURE_OPENAI_API_KEY=your-key Step 9: Create the Notebook In VS Code: Click File > New File Save as multi_agent_analyzer.ipynb In the top-right of the notebook, click Select Kernel Choose Semantic Kernel (Python 3.12) (or your sk-env) Your environment is ready. Let's build. 6. Step-by-Step: Building the Multi-Agent Analyzer Cell 1: Verify Setup import semantic_kernel print(f"Semantic Kernel version: {semantic_kernel.__version__}") from semantic_kernel.agents import ( ChatCompletionAgent, GroupChatOrchestration, RoundRobinGroupChatManager, ) from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent print("All imports successful") Cell 2: Load API Key and Create Service For Gemini: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.google.google_ai import ( GoogleAIChatCompletion, GoogleAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory GEMINI_API_KEY = os.getenv("GOOGLE_AI_API_KEY") GEMINI_MODEL = os.getenv("GOOGLE_AI_GEMINI_MODEL_ID", "gemini-2.5-flash") service = GoogleAIChatCompletion( gemini_model_id=GEMINI_MODEL, api_key=GEMINI_API_KEY, ) print(f"Service created: Gemini {GEMINI_MODEL}") # Smoke test settings = GoogleAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") For OpenAI: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.open_ai import ( OpenAIChatCompletion, OpenAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory service = OpenAIChatCompletion( ai_model_id=os.getenv("OPENAI_CHAT_MODEL_ID", "gpt-4o"), ) print(f"Service created: OpenAI {os.getenv('OPENAI_CHAT_MODEL_ID', 'gpt-4o')}") # Smoke test settings = OpenAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") Cell 3: Define All 4 Agents This is the most important cell โ the prompt engineering that makes the demo work: from semantic_kernel.agents import ChatCompletionAgent # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 1: Client-Side Analyst # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ client_agent = ChatCompletionAgent( name="ClientAnalyst", description="Analyzes problems from the client-side: browser, JS, CORS, caching, UI symptoms", instructions="""You are ONLY **ClientAnalyst**. You must NEVER speak as NetworkAnalyst, ServerAnalyst, or Coordinator. Every word you write is from ClientAnalyst's perspective only. You are a senior front-end and client-side diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the client side: 1. **Browser & Rendering**: DOM issues, JavaScript errors, CSS rendering, browser compatibility, memory leaks, console errors. 2. **Client-Side Caching**: Stale cache, service worker issues, local storage corruption. 3. **Network from Client View**: CORS errors, preflight failures, request timeouts, client-side retry storms, fetch/XHR configuration. 4. **Upload Handling**: File API usage, chunk upload implementation, progress tracking, FormData construction, content-type headers. 5. **UI/UX Symptoms**: What the user sees, error messages displayed, loading states. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference NetworkAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the client perspective - Do NOT just say 'I agree' โ provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 2: Network Analyst # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ network_agent = ChatCompletionAgent( name="NetworkAnalyst", description="Analyzes problems from the network side: DNS, TCP, TLS, firewalls, load balancers, latency", instructions="""You are ONLY **NetworkAnalyst**. You must NEVER speak as ClientAnalyst, ServerAnalyst, or Coordinator. Every word you write is from NetworkAnalyst's perspective only. You are a senior network infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the network layer: 1. **DNS & Resolution**: DNS TTL, propagation delays, record misconfigurations. 2. **TCP/IP & Connections**: Connection pooling, keep-alive, TCP window scaling, connection resets, SYN floods. 3. **TLS/SSL**: Certificate issues, handshake failures, protocol version mismatches. 4. **Load Balancers & Proxies**: Sticky sessions, health checks, timeout configs, request body size limits, proxy buffering. 5. **Firewall & WAF**: Rule blocks, rate limiting, request inspection delays, geo-blocking, DDoS protection interference. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the network perspective - Do NOT just say 'I am ready to proceed' โ provide substantive technical analysis Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 3: Server-Side Analyst # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ server_agent = ChatCompletionAgent( name="ServerAnalyst", description="Analyzes problems from the server side: backend app, database, logs, resources, deployments", instructions="""You are ONLY **ServerAnalyst**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator. Every word you write is from ServerAnalyst's perspective only. You are a senior backend and infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the server side: 1. **Application Server**: Error logs, exception traces, thread pool exhaustion, memory leaks, CPU spikes, garbage collection pauses. 2. **Database**: Slow queries, connection pool saturation, lock contention, deadlocks, replication lag, query plan changes. 3. **Deployment & Config**: Recent deployments, configuration changes, feature flags, environment variable mismatches, rollback candidates. 4. **Resource Limits**: File upload size limits, request body limits, disk space, temporary file cleanup, storage quotas. 5. **External Dependencies**: Upstream API timeouts, third-party service degradation, queue backlogs, cache (Redis/Memcached) issues. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and NetworkAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the server perspective - Do NOT just say 'I agree' โ provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 4: Coordinator # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ coordinator_agent = ChatCompletionAgent( name="Coordinator", description="Synthesizes all specialist analyses into a final root cause report with prioritized action plan", instructions="""You are ONLY **Coordinator**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or ServerAnalyst. You synthesize โ you do NOT do domain-specific analysis. You are the lead engineer who synthesizes the team's findings. โโโ ROUND 1 BEHAVIOR (your first turn, message 4) โโโ Keep this SHORT โ maximum 300 words. - Note 2-3 KEY PATTERNS across the three analyses - Identify where specialists AGREE (high-confidence) - Identify where they CONTRADICT (needs resolution) - Ask 2-3 SPECIFIC QUESTIONS for Round 2 Round 1 MUST NOT: assign tasks, create action plans, write reports, or tell agents what to take lead on. Observation + questions ONLY. โโโ ROUND 2 BEHAVIOR (your final turn, message 8) โโโ Keep this FOCUSED โ maximum 800 words. Produce a structured report: 1. **Root Cause** (1 paragraph): The #1 most likely cause with causal chain across layers. Reference specific findings from each specialist. 2. **Confidence** (short list): - HIGH: Areas where all 3 agreed - MEDIUM: Areas where 2 of 3 agreed - LOW: Disagreements needing investigation 3. **Action Plan** (numbered, max 6 items): For each: - What to do (specific) - Owner (Client/Network/Server team) - Time estimate 4. **Quick Wins vs Long-term** (2 short lists) Do NOT repeat what specialists already said verbatim. Synthesize, don't echo.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # All 4 agents โ order = RoundRobin order # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ agents = [client_agent, network_agent, server_agent, coordinator_agent] print(f"{len(agents)} agents created:") for i, a in enumerate(agents, 1): print(f" {i}. {a.name}: {a.description[:60]}...") print(f"\nRoundRobin order: {' โ '.join(a.name for a in agents)}") Cell 4: Run the Analysis from semantic_kernel.agents import GroupChatOrchestration, RoundRobinGroupChatManager from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent from IPython.display import display, Markdown # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # โ EDIT YOUR PROBLEM STATEMENT HERE โ # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ PROBLEM = """ Users are reporting intermittent 504 Gateway Timeout errors when trying to upload files larger than 10MB through our web application. The issue started after last Friday's deployment and seems worse during peak hours (2-5 PM EST). Some users also report that smaller file uploads work fine but the progress bar freezes at 85% for large files before timing out. """ # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ agent_responses = [] def agent_response_callback(message: ChatMessageContent) -> None: name = message.name or "Unknown" content = message.content or "" agent_responses.append({"agent": name, "content": content}) emoji = { "ClientAnalyst": "๐ฅ๏ธ", "NetworkAnalyst": "๐", "ServerAnalyst": "โ๏ธ", "Coordinator": "๐ฏ" }.get(name, "๐น") round_num = (len(agent_responses) - 1) // len(agents) + 1 display(Markdown( f"---\n### {emoji} {name} (Message {len(agent_responses)}, Round {round_num})\n\n{content}" )) MAX_ROUNDS = 8 # 4 agents ร 2 rounds = 8 messages exactly task = f"""## Problem Statement {PROBLEM.strip()} ## Discussion Rules You are in a GROUP DISCUSSION with 4 members. You can see ALL previous messages. There are exactly 2 rounds. ### ROUND 1 (Messages 1-4): Independent Analysis - ClientAnalyst, NetworkAnalyst, ServerAnalyst: Analyze from YOUR domain only. Give your top 3 most likely causes with evidence and reasoning. - Coordinator: Note patterns across the 3 analyses. Ask 2-3 specific questions. Do NOT assign tasks yet. ### ROUND 2 (Messages 5-8): Cross-Examination & Final Report - ClientAnalyst, NetworkAnalyst, ServerAnalyst: You MUST reference the OTHER specialists BY NAME. State where you agree, disagree, or have new insights. Answer the Coordinator's questions. Provide SUBSTANTIVE analysis. - Coordinator: Produce the FINAL structured report: root cause, confidence levels, prioritized action plan with owners and time estimates. IMPORTANT: Each agent speaks as THEMSELVES only. Never impersonate another agent.""" display(Markdown(f"## Problem Statement\n\n{PROBLEM.strip()}")) display(Markdown(f"---\n## Discussion Starting โ {len(agents)} agents, {MAX_ROUNDS} rounds\n")) # Build and run orchestration = GroupChatOrchestration( members=agents, manager=RoundRobinGroupChatManager(max_rounds=MAX_ROUNDS), agent_response_callback=agent_response_callback, ) runtime = InProcessRuntime() runtime.start() result = await orchestration.invoke(task=task, runtime=runtime) final_result = await result.get(timeout=300) await runtime.stop_when_idle() display(Markdown(f"---\n## FINAL CONCLUSION\n\n{final_result}")) Cell 5: Statistics and Validation print("โ" * 55) print(" ANALYSIS STATISTICS") print("โ" * 55) emojis = {"ClientAnalyst": "๐ฅ๏ธ", "NetworkAnalyst": "๐", "ServerAnalyst": "โ๏ธ", "Coordinator": "๐ฏ"} agent_counts = {} agent_chars = {} for r in agent_responses: agent_counts[r["agent"]] = agent_counts.get(r["agent"], 0) + 1 agent_chars[r["agent"]] = agent_chars.get(r["agent"], 0) + len(r["content"]) for agent, count in agent_counts.items(): em = emojis.get(agent, "๐น") chars = agent_chars.get(agent, 0) avg = chars // count if count else 0 print(f" {em} {agent}: {count} msg(s), ~{chars:,} chars (avg {avg:,}/msg)") print(f"\n Total messages: {len(agent_responses)}") total_chars = sum(len(r['content']) for r in agent_responses) print(f" Total analysis: ~{total_chars:,} characters") # Validation print(f"\n Validation:") import re identity_issues = [] for r in agent_responses: other_agents = [a.name for a in agents if a.name != r["agent"]] for other in other_agents: pattern = rf'(?i)as {re.escape(other)}[,:]?\s+I\b' if re.search(pattern, r["content"][:300]): identity_issues.append(f"{r['agent']} impersonated {other}") if identity_issues: print(f" Identity confusion: {identity_issues}") else: print(f" No identity confusion detected") thin = [r for r in agent_responses if len(r["content"].strip()) < 100] if thin: for t in thin: print(f" Thin response from {t['agent']}") else: print(f" All responses are substantive") Cell 6: Save Report from datetime import datetime timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"analysis_report_{timestamp}.md" with open(filename, "w", encoding="utf-8") as f: f.write(f"# Problem Analysis Report\n\n") f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n") f.write(f"**Agents:** {', '.join(a.name for a in agents)}\n") f.write(f"**Rounds:** {MAX_ROUNDS}\n\n---\n\n") f.write(f"## Problem Statement\n\n{PROBLEM.strip()}\n\n---\n\n") for i, r in enumerate(agent_responses, 1): em = emojis.get(r['agent'], '๐น') round_num = (i - 1) // len(agents) + 1 f.write(f"### {em} {r['agent']} (Message {i}, Round {round_num})\n\n") f.write(f"{r['content']}\n\n---\n\n") f.write(f"## Final Conclusion\n\n{final_result}\n") print(f"Report saved to: {filename}") 7. The Agent Interaction Flow โ Round by Round Here's what actually happens during the 8-message orchestration: Round 1: Independent Analysis (Messages 1-4) Msg Agent What They See What They Do 1 ClientAnalyst Problem statement only Analyzes from client perspective: upload chunking, progress bar freezing at 85%, CORS, content-type headers 2 NetworkAnalyst Problem + ClientAnalyst's analysis Gives INDEPENDENT analysis despite seeing msg 1: load balancer timeouts, proxy body size limits, TCP window scaling 3 ServerAnalyst Problem + msgs 1-2 Gives INDEPENDENT analysis: recent deployment regression, request body parser, thread pool exhaustion, disk space 4 Coordinator Problem + msgs 1-3 Observes patterns: "All three mention timeout configuration. ClientAnalyst and NetworkAnalyst both point to body size. Question: Was the deployment a backend-only change or did it include infra?" Round 2: Cross-Examination (Messages 5-8) Msg Agent What They Do 5 ClientAnalyst "I agree with NetworkAnalyst that the load balancer timeout is likely a factor โ the 85% freeze point matches the 30-second LB timeout for a 10MB upload on our average upload speed. However, I disagree with ServerAnalyst about thread pool exhaustion because the UI shows a clean 504, not a connection reset." 6 NetworkAnalyst "ServerAnalyst's point about the recent deployment is critical โ if the new request parser is 3x slower, that would push uploads past the LB timeout. I can confirm the LB has a 30s idle timeout. The fix is both: increase LB timeout AND optimize the parser." 7 ServerAnalyst "Responding to Coordinator's question: The deployment was backend-only โ a new multipart parser using streaming instead of buffered reads. ClientAnalyst is correct that the 504 is from the LB, not the app. The app itself returns 200 after 45 seconds, but the LB kills the connection at 30." 8 Coordinator Produces final structured report with root cause: "The backend deployment introduced a slower multipart parser (45s vs 15s for 10MB). The load balancer's 30s timeout kills the connection at ~85% progress. Fix: immediate โ increase LB timeout to 120s. Short-term โ optimize parser. Long-term โ implement chunked uploads with progress resumption." Notice: The Round 2 analysis is dramatically better than Round 1. Agents reference each other by name, build on each other's findings, and the Coordinator can synthesize a cross-layer causal chain that no single agent could have produced. I made a small adjustment to the issue with Azure Web Apps. Please find the details below from testing carried out using Google Gemini: 8. Bugs I Found & Fixed โ Lessons Learned Building this demo taught me several important lessons about multi-agent systems: Bug 1: Agents Speaking Only Once Symptom: Only 4 messages instead of 8. Root cause: The agents list was missing the Coordinator. It was defined in a separate cell and wasn't included in the members list. Fix: All 4 agents must be in the same list passed to GroupChatOrchestration. Bug 2: NetworkAnalyst Says "I'm Ready to Proceed" Symptom: NetworkAnalyst's Round 2 response was just "I'm ready to proceed with the analysis" โ no actual content. Root cause: The Coordinator's Round 1 message was assigning tasks ("NetworkAnalyst, please check the load balancer config"), and the agent was acknowledging the assignment instead of analyzing. Fix: Added explicit constraint to Coordinator: "Round 1 MUST NOT assign tasks โ observation + questions ONLY." Bug 3: ServerAnalyst Says "As NetworkAnalyst, I..." Symptom: ServerAnalyst's response started with "As NetworkAnalyst, I believe..." Root cause: LLM identity bleeding. When agents share ChatHistory, the LLM sometimes loses track of which agent it's currently playing. This is especially common with Gemini. Fix: Identity anchoring at the very top of every agent's instructions: "You are ONLY ServerAnalyst. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator." Bug 4: Gemini Gives Thin/Empty Responses Symptom: Some agents responded with just one sentence or "I concur." Root cause: Gemini 2.5 Flash is more concise than GPT-4o by default. Without explicit length requirements, it takes shortcuts. Fix: Added "Every response MUST be at least 200 words" and "Answer the Coordinator's questions" to every specialist's instructions. Bug 5: Coordinator's Report is 18K Characters Symptom: The Coordinator's Round 2 response was absurdly long โ repeating everything every specialist said. Fix: Added word limits: "Round 1 max 300 words, Round 2 max 800 words" and "Synthesize, don't echo." Bug 6: MAX_ROUNDS Math Symptom: With MAX_ROUNDS=9, ClientAnalyst spoke a 3rd time after the Coordinator's final report โ breaking the clean 2-round structure. Fix: MAX_ROUNDS must equal (number of agents ร number of rounds). For 4 agents ร 2 rounds = 8. 9. Running with Different AI Providers The beauty of SK's Strategy Pattern is that you change ONE LINE to switch providers. Everything else โ agents, orchestration, callbacks, validation โ stays identical. Gemini setup: from semantic_kernel.connectors.ai.google.google_ai import GoogleAIChatCompletion service = GoogleAIChatCompletion( gemini_model_id="gemini-2.5-flash", api_key=os.getenv("GOOGLE_AI_API_KEY"), ) OpenAI Setup from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion service = OpenAIChatCompletion( ai_model_id="gpt-4o", api_key=os.getenv("OPEN_AI_API_KEY"), ) 10. What to Build Next Add Plugins to Agents Give agents real tools โ not just LLM reasoning - looks exciting right ;) class NetworkDiagnosticPlugin: (description="Pings a host and returns latency") def ping(self, host: str) -> str: result = subprocess.run(["ping", "-c", "3", host], capture_output=True, text=True) return result.stdout class LogSearchPlugin: (description="Searches server logs for error patterns") def search_logs(self, pattern: str, hours: int = 1) -> str: # Query your log aggregator (Splunk, ELK, Azure Monitor) return query_logs(pattern, hours) Add Filters for Governance Intercept every agent call for PII redaction and audit logging: .filter(filter_type=FilterTypes.FUNCTION_INVOCATION) async def audit_filter(context, next): print(f"[AUDIT] {context.function.name} called by agent") await next(context) print(f"[AUDIT] {context.function.name} returned") Try Different Orchestration Patterns Replace GroupChat with Sequential for a pipeline approach: # Instead of debate, each agent builds on the previous orchestration = SequentialOrchestration( members=[client_agent, network_agent, server_agent, coordinator_agent] ) Or Concurrent for parallel analysis: # All specialists analyze simultaneously, Coordinator aggregates orchestration = ConcurrentOrchestration( members=[client_agent, network_agent, server_agent] ) Deploy to Azure Move from InProcessRuntime to Azure Container Apps for production scaling. The agent code doesn't change โ only the runtime. Summary The key insight from building this demo: multi-agent systems produce better results than single agents not because each agent is smarter, but because the debate structure forces cross-domain thinking that a single prompt can never achieve. The Coordinator's final report consistently identifies causal chains that span client, network, and server layers โ exactly the kind of insight that production incident response teams need. Semantic Kernel makes this possible with clean separation of concerns: agents define WHAT to analyze, orchestration defines HOW they interact, the manager defines WHO speaks when, the runtime handles WHERE it executes, and callbacks let you OBSERVE everything. Each piece is independently swappable โ that's the power of SK from Microsoft. Resources: GitHub: github.com/microsoft/semantic-kernel Docs: learn.microsoft.com/semantic-kernel Orchestration Patterns: learn.microsoft.com/semantic-kernel/frameworks/agent/agent-orchestration Discord: aka.ms/sk/discord Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for building AI agents with SK (Semantic Kernel) ๐297Views3likes0CommentsMicrosoft App Access Panel requires MFA but we didn't enable it
Hi. Recently we've received a report from a user that he was asked to perform MFA when he was signing in. After checking the sign-in logs, we've found that it was an application called "Microsoft App Access Panel" and the status of that sign-in attempt was "interrupted". The detail of the log tells us that the authentication policies applied was "App requires MFA", but we couldn't find that policy anywhere in Conditional Access. The only MFA-related policy in Conditional Access was a policy that will requires user to perform MFA only including "Office 365 Exchange Online" but since the policy is not related to "Microsoft App Access Panel"(?) and the said user was excluded from that policy, I don't think that's the issue. We have already set the "Enable Security defaults" to "No" and we've checked that the "Multi-factor Auth Status" for the user was "Disabled". Does anyone knows where in Azure could be possibly causing MFA? Thanks.48KViews5likes19Comments