azure resource management
140 TopicsExcited to share my latest open-source project: KubeCost Guardian
After seeing how many DevOps teams struggle with Kubernetes cost visibility on Azure, I built a full-stack cost optimization platform from scratch. ๐ช๐ต๐ฎ๐ ๐ถ๐ ๐ฑ๐ผ๐ฒ๐: โ Real-time AKS cluster monitoring via Azure SDK โ Cost breakdown per namespace, node, and pod โ AI-powered recommendations generated from actual cluster state โ One-click optimization actions โ JWT-secured dashboard with full REST API ๐ง๐ฒ๐ฐ๐ต ๐ฆ๐๐ฎ๐ฐ๐ธ: - React 18 + TypeScript + Vite - Tailwind CSS + shadcn/ui + Recharts - Node.js + Express + TypeScript - Azure SDK (@azure/arm-containerservice) - JWT Authentication + Azure Service Principal ๐ช๐ต๐ฎ๐ ๐บ๐ฎ๐ธ๐ฒ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐: Most cost tools show you generic estimates. KubeCost Guardian reads your actual VM size, node count, and cluster configuration to generate recommendations that are specific to your infrastructure not averages. For example, if your cluster has only 2 nodes with no autoscaler enabled, it immediately flags the HA risk and calculates exactly how much you'd save by switching to Spot instances based on your actual VM size. This project is fully open-source and built for the DevOps community. โญ GitHub: https://github.com/HlaliMedAmine/kubecost-guardian This project represents hours of hard work, and passion. I decided to make it open-source so everyone can benefit from it ๐ค ,If you find it useful, Iโd really appreciate your support . Your support motivates me to keep building and sharing more powerful projects ๐. More exciting ideas are coming soonโฆ stay tuned! ๐ฅ.32Views0likes0CommentsPipeline Intelligence is live and open-source real-time Azure DevOps monitoring powered by AI .
Every DevOps team I've worked with had the same problem: Slow pipelines. Zero visibility. No idea where to start. So I stopped complaining and built the solution. So I built something about it. โก Pipeline Intelligence is a full-stack Azure DevOps monitoring dashboard that: โ Connects to your real Azure DevOps organization via REST API โ Detects bottlenecks across all your pipelines automatically โ Calculates exactly how much time your team is wasting per month โ Uses Gemini AI to generate prioritized fixes with ready-to-paste YAML solutions โ JWT-secured, Docker-ready, and fully open-source Tech Stack: โ React 18 + Vite + Tailwind CSS โ Node.js + Express + Azure DevOps API v7 โ Google Gemini 1.5 Flash โ JWT Authentication + Docker ๐ช๐ต๐ฎ๐ ๐บ๐ฎ๐ธ๐ฒ๐ ๐ถ๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐? Most tools show you generic estimates. Pipeline Intelligence reads your actual cluster config, node count, and pipeline structure and gives you recommendations specific to your infrastructure. ๐ฏ This year, I set myself a personal challenge: Build and open-source a series of production-grade tools exclusively focused on Azure services tools that solve real problems for real DevOps teams. This project represents weeks of research, architecture decisions, and late-night debugging sessions. I'm sharing it with the community because I believe great tooling should be accessible to everyone not locked behind enterprise paywalls. If this resonates with you, I have one simple ask: ๐ A like, a comment, or a share takes 3 seconds but it helps this reach the DevOps engineers who need it most. Your support is what keeps me building. โค๏ธ GitHub: https://github.com/HlaliMedAmine/pipeline-intelligence31Views0likes0CommentsAzure support team not responding to support request
I am posting here because I have not received a response to my support request despite my plan stating that I should hear back within 8 hours. It has now gone a day beyond that limit, and I am still waiting for assistance with this urgent matter. This issue is critical for my operations, and the delay is unacceptable. The ticket/reference number for my original support request was 2410100040000309. And I have created a brand new service request with ID 2412160040010160. I need this addressed immediately.813Views1like8CommentsBuilding Multi-Agent Orchestration Using Microsoft Semantic Kernel: A Complete Step-by-Step Guide
What You Will Build By the end of this guide, you will have a working multi-agent system where 4 specialist AI agents collaborate to diagnose production issues: ClientAnalyst โ Analyzes browser, JavaScript, CORS, uploads, and UI symptoms NetworkAnalyst โ Analyzes DNS, TCP/IP, TLS, load balancers, and firewalls ServerAnalyst โ Analyzes backend logs, database, deployments, and resource limits Coordinator โ Synthesizes all findings into a root cause report with a prioritized action plan These agents don't just run in sequence โ they debate, cross-examine, and challenge each other's findings through a shared conversation, producing a diagnosis that's better than any single agent could achieve alone. Table of Contents Why Multi-Agent? The Problem with Single Agents Architecture Overview Understanding the Key SK Components The Actor Model โ How InProcessRuntime Works Setting Up Your Development Environment Step-by-Step: Building the Multi-Agent Analyzer The Agent Interaction Flow โ Round by Round Bugs I Found & Fixed โ Lessons Learned Running with Different AI Providers What to Build Next 1. Why Multi-Agent? The Problem with Single Agents A single AI agent analyzing a production issue is like having one doctor diagnose everything โ they'll catch issues in their specialty but miss cross-domain connections. Consider this problem: "Users report 504 Gateway Timeout errors when uploading files larger than 10MB. Started after Friday's deployment. Worse during peak hours." A single agent might say "it's a server timeout" and stop. But the real root cause often spans multiple layers: The client is sending chunked uploads with an incorrect Content-Length header (client-side bug) The load balancer has a 30-second timeout that's too short for large uploads (network config) The server recently deployed a new request body parser that's 3x slower (server-side regression) The combination only fails during peak hours because connection pool saturation amplifies the latency No single perspective catches this. You need specialists who analyze independently, then debate to find the cross-layer causal chain. That's what multi-agent orchestration gives you. The 5 Orchestration Patterns in SK Semantic Kernel provides 5 built-in patterns for agent collaboration: SEQUENTIAL: A โ B โ C โ Done (pipeline โ each builds on previous) CONCURRENT: โ A โ Task โ B โ Aggregate โ C โ (parallel โ results merged) GROUP CHAT: A โ B โ C โ D โ We use this one (rounds, shared history, debate) HANDOFF: A โ (stuck?) โ B โ (complex?) โ Human (escalation with human-in-the-loop) MAGENTIC: LLM picks who speaks next dynamically (AI-driven speaker selection) We use GroupChatOrchestration with RoundRobinGroupChatManager because our problem requires agents to see each other's work, challenge assumptions, and build on each other's analysis across two rounds. 2. Architecture Overview Here's the complete architecture of what we're building: 3. Understanding the Key SK Components Before we write code, let's understand the 5 components we'll use and the design pattern each implements: ChatCompletionAgent โ Strategy Pattern The agent definition. Each agent is a combination of: name โ unique identifier (used in round-robin ordering) instructions โ the persona and rules (this is the prompt engineering) service โ which AI provider to call (Strategy Pattern โ swap providers without changing agent logic) description โ what other agents/tools understand about this agent agent = ChatCompletionAgent( name="ClientAnalyst", instructions="You are ONLY ClientAnalyst...", service=gemini_service, # โ Strategy: swap to OpenAI with zero changes description="Analyzes client-side issues", ) GroupChatOrchestration โ Mediator Pattern The orchestration defines HOW agents interact. It's the Mediator โ agents don't talk to each other directly. Instead, the orchestration manages a shared ChatHistory and routes messages through the Manager. RoundRobinGroupChatManager โ Strategy Pattern The Manager decides WHO speaks next. RoundRobinGroupChatManager cycles through agents in a fixed order. SK also provides AutomaticGroupChatManager where the LLM decides who speaks next. max_rounds is the total number of messages per agent or cycle. With 4 agents and max_rounds=8, each agent speaks exactly twice. InProcessRuntime โ Actor Model Abstraction The execution engine. Every agent becomes an "actor" with its own kind of mailbox (message queue). The runtime delivers messages between actors. Key properties: No shared state โ agents communicate only through messages Sequential processing โ each agent processes one message at a time Location transparency โ same code works in-process today, distributed tomorrow agent_response_callback โ Observer Pattern A function that fires after EVERY agent response. We use it to display each agent's output in real-time with emoji labels and round numbers. 4. The Actor Model โ How InProcessRuntime Works The Actor Model is a concurrency pattern where each entity is an isolated "actor" with a private mailbox. Here's what happens inside InProcessRuntime when we run our demo: runtime.start() โ โโโ Creates internal message loop (asyncio event loop) โ orchestration.invoke(task="504 timeout...", runtime=runtime) โ โโโ Creates Actor[Orchestrator] โ manages overall flow โโโ Creates Actor[Manager] โ RoundRobinGroupChatManager โโโ Creates Actor[ClientAnalyst] โ mailbox created, waiting โโโ Creates Actor[NetworkAnalyst] โ mailbox created, waiting โโโ Creates Actor[ServerAnalyst] โ mailbox created, waiting โโโ Creates Actor[Coordinator] โ mailbox created, waiting Manager receives "start" message โ โโโ Checks turn order: [Client, Network, Server, Coordinator] โโโ Sends task to ClientAnalyst mailbox โ โ ClientAnalyst processes: calls LLM โ response โ โ Response added to shared ChatHistory โ โ callback fires (displayed in Notebook UI) โ โ Sends "done" back to Manager โ โโโ Manager updates: turn_index=1 โโโ Sends to NetworkAnalyst mailbox โ โ Same flow... โ โโโ ... (ServerAnalyst, Coordinator for Round 1) โ โโโ Manager checks: messages=4, max_rounds=8 โ continue โ โโโ Round 2: same cycle with cross-examination โ โโโ After message 8: Manager sends "complete" โ OrchestrationResult resolves โ result.get() returns final answer runtime.stop_when_idle() โ All mailboxes empty โ clean shutdown The Actor Model guarantees: No race conditions (each actor processes one message at a time) No deadlocks (no shared locks to contend for) No shared mutable state (agents communicate only via messages) 5. Setting Up Your Development Environment Prerequisites Python 3.11 or 3.12 (3.13+ may have compatibility issues with some SK connectors) Visual Studio Code with the Python and Jupyter extensions An API key from one of: Google AI Studio (free), OpenAI Step 1: Install Python Download from python.org. During installation, check "Add Python to PATH". Verify: python --version # Python 3.12.x Step 2: Install VS Code Extensions Open VS Code, go to Extensions (Ctrl+Shift+X), and install: Python (by Microsoft) โ Python language support Jupyter (by Microsoft) โ Notebook support Pylance (by Microsoft) โ IntelliSense and type checking Step 3: Create Project Folder mkdir sk-multiagent-demo cd sk-multiagent-demo Open in VS Code: code . Step 4: Create Virtual Environment Open the VS Code terminal (Ctrl+`) and run: # Create virtual environment python -m venv sk-env # Activate it # Windows: sk-env\Scripts\activate # macOS/Linux: source sk-env/bin/activate You should see (sk-env) in your terminal prompt. Step 5: Install Semantic Kernel For Google Gemini (free tier โ recommended for getting started): pip install semantic-kernel[google] python-dotenv ipykernel For OpenAI (paid API key): pip install semantic-kernel openai python-dotenv ipykernel For Azure AI Foundry (enterprise, Entra ID auth): pip install semantic-kernel azure-identity python-dotenv ipykernel Step 6: Register the Jupyter Kernel python -m ipykernel install --user --name=sk-env --display-name="Semantic Kernel (Python 3.12)" You can also select if this is already available from your environment from VSCode as below: Step 7: Get Your API Key Option A โ Google Gemini (FREE, recommended for demo): Go to https://aistudio.google.com/apikey Click "Create API Key" Copy the key Free tier limits: 15 requests/minute, 1 million tokens/minute โ more than enough for this demo. Option B โ OpenAI: Go to https://platform.openai.com/api-keys Create a new key Copy the key Option C โ Azure AI Foundry: Deploy a model in Azure AI Foundry portal Note the endpoint URL and deployment name If key-based auth is disabled, you'll need Entra ID with permissions Step 8: Create the .env File In your project root, create a file named .env: For Gemini: GOOGLE_AI_API_KEY=AIzaSy...your-key-here GOOGLE_AI_GEMINI_MODEL_ID=gemini-2.5-flash For OpenAI: OPENAI_API_KEY=sk-...your-key-here OPENAI_CHAT_MODEL_ID=gpt-4o For Azure AI Foundry: AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4o AZURE_OPENAI_API_KEY=your-key Step 9: Create the Notebook In VS Code: Click File > New File Save as multi_agent_analyzer.ipynb In the top-right of the notebook, click Select Kernel Choose Semantic Kernel (Python 3.12) (or your sk-env) Your environment is ready. Let's build. 6. Step-by-Step: Building the Multi-Agent Analyzer Cell 1: Verify Setup import semantic_kernel print(f"Semantic Kernel version: {semantic_kernel.__version__}") from semantic_kernel.agents import ( ChatCompletionAgent, GroupChatOrchestration, RoundRobinGroupChatManager, ) from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent print("All imports successful") Cell 2: Load API Key and Create Service For Gemini: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.google.google_ai import ( GoogleAIChatCompletion, GoogleAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory GEMINI_API_KEY = os.getenv("GOOGLE_AI_API_KEY") GEMINI_MODEL = os.getenv("GOOGLE_AI_GEMINI_MODEL_ID", "gemini-2.5-flash") service = GoogleAIChatCompletion( gemini_model_id=GEMINI_MODEL, api_key=GEMINI_API_KEY, ) print(f"Service created: Gemini {GEMINI_MODEL}") # Smoke test settings = GoogleAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") For OpenAI: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.open_ai import ( OpenAIChatCompletion, OpenAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory service = OpenAIChatCompletion( ai_model_id=os.getenv("OPENAI_CHAT_MODEL_ID", "gpt-4o"), ) print(f"Service created: OpenAI {os.getenv('OPENAI_CHAT_MODEL_ID', 'gpt-4o')}") # Smoke test settings = OpenAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") Cell 3: Define All 4 Agents This is the most important cell โ the prompt engineering that makes the demo work: from semantic_kernel.agents import ChatCompletionAgent # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 1: Client-Side Analyst # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ client_agent = ChatCompletionAgent( name="ClientAnalyst", description="Analyzes problems from the client-side: browser, JS, CORS, caching, UI symptoms", instructions="""You are ONLY **ClientAnalyst**. You must NEVER speak as NetworkAnalyst, ServerAnalyst, or Coordinator. Every word you write is from ClientAnalyst's perspective only. You are a senior front-end and client-side diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the client side: 1. **Browser & Rendering**: DOM issues, JavaScript errors, CSS rendering, browser compatibility, memory leaks, console errors. 2. **Client-Side Caching**: Stale cache, service worker issues, local storage corruption. 3. **Network from Client View**: CORS errors, preflight failures, request timeouts, client-side retry storms, fetch/XHR configuration. 4. **Upload Handling**: File API usage, chunk upload implementation, progress tracking, FormData construction, content-type headers. 5. **UI/UX Symptoms**: What the user sees, error messages displayed, loading states. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference NetworkAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the client perspective - Do NOT just say 'I agree' โ provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 2: Network Analyst # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ network_agent = ChatCompletionAgent( name="NetworkAnalyst", description="Analyzes problems from the network side: DNS, TCP, TLS, firewalls, load balancers, latency", instructions="""You are ONLY **NetworkAnalyst**. You must NEVER speak as ClientAnalyst, ServerAnalyst, or Coordinator. Every word you write is from NetworkAnalyst's perspective only. You are a senior network infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the network layer: 1. **DNS & Resolution**: DNS TTL, propagation delays, record misconfigurations. 2. **TCP/IP & Connections**: Connection pooling, keep-alive, TCP window scaling, connection resets, SYN floods. 3. **TLS/SSL**: Certificate issues, handshake failures, protocol version mismatches. 4. **Load Balancers & Proxies**: Sticky sessions, health checks, timeout configs, request body size limits, proxy buffering. 5. **Firewall & WAF**: Rule blocks, rate limiting, request inspection delays, geo-blocking, DDoS protection interference. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the network perspective - Do NOT just say 'I am ready to proceed' โ provide substantive technical analysis Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 3: Server-Side Analyst # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ server_agent = ChatCompletionAgent( name="ServerAnalyst", description="Analyzes problems from the server side: backend app, database, logs, resources, deployments", instructions="""You are ONLY **ServerAnalyst**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator. Every word you write is from ServerAnalyst's perspective only. You are a senior backend and infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the server side: 1. **Application Server**: Error logs, exception traces, thread pool exhaustion, memory leaks, CPU spikes, garbage collection pauses. 2. **Database**: Slow queries, connection pool saturation, lock contention, deadlocks, replication lag, query plan changes. 3. **Deployment & Config**: Recent deployments, configuration changes, feature flags, environment variable mismatches, rollback candidates. 4. **Resource Limits**: File upload size limits, request body limits, disk space, temporary file cleanup, storage quotas. 5. **External Dependencies**: Upstream API timeouts, third-party service degradation, queue backlogs, cache (Redis/Memcached) issues. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and NetworkAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the server perspective - Do NOT just say 'I agree' โ provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # AGENT 4: Coordinator # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ coordinator_agent = ChatCompletionAgent( name="Coordinator", description="Synthesizes all specialist analyses into a final root cause report with prioritized action plan", instructions="""You are ONLY **Coordinator**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or ServerAnalyst. You synthesize โ you do NOT do domain-specific analysis. You are the lead engineer who synthesizes the team's findings. โโโ ROUND 1 BEHAVIOR (your first turn, message 4) โโโ Keep this SHORT โ maximum 300 words. - Note 2-3 KEY PATTERNS across the three analyses - Identify where specialists AGREE (high-confidence) - Identify where they CONTRADICT (needs resolution) - Ask 2-3 SPECIFIC QUESTIONS for Round 2 Round 1 MUST NOT: assign tasks, create action plans, write reports, or tell agents what to take lead on. Observation + questions ONLY. โโโ ROUND 2 BEHAVIOR (your final turn, message 8) โโโ Keep this FOCUSED โ maximum 800 words. Produce a structured report: 1. **Root Cause** (1 paragraph): The #1 most likely cause with causal chain across layers. Reference specific findings from each specialist. 2. **Confidence** (short list): - HIGH: Areas where all 3 agreed - MEDIUM: Areas where 2 of 3 agreed - LOW: Disagreements needing investigation 3. **Action Plan** (numbered, max 6 items): For each: - What to do (specific) - Owner (Client/Network/Server team) - Time estimate 4. **Quick Wins vs Long-term** (2 short lists) Do NOT repeat what specialists already said verbatim. Synthesize, don't echo.""", service=service, ) # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # All 4 agents โ order = RoundRobin order # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ agents = [client_agent, network_agent, server_agent, coordinator_agent] print(f"{len(agents)} agents created:") for i, a in enumerate(agents, 1): print(f" {i}. {a.name}: {a.description[:60]}...") print(f"\nRoundRobin order: {' โ '.join(a.name for a in agents)}") Cell 4: Run the Analysis from semantic_kernel.agents import GroupChatOrchestration, RoundRobinGroupChatManager from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent from IPython.display import display, Markdown # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ # โ EDIT YOUR PROBLEM STATEMENT HERE โ # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ PROBLEM = """ Users are reporting intermittent 504 Gateway Timeout errors when trying to upload files larger than 10MB through our web application. The issue started after last Friday's deployment and seems worse during peak hours (2-5 PM EST). Some users also report that smaller file uploads work fine but the progress bar freezes at 85% for large files before timing out. """ # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ agent_responses = [] def agent_response_callback(message: ChatMessageContent) -> None: name = message.name or "Unknown" content = message.content or "" agent_responses.append({"agent": name, "content": content}) emoji = { "ClientAnalyst": "๐ฅ๏ธ", "NetworkAnalyst": "๐", "ServerAnalyst": "โ๏ธ", "Coordinator": "๐ฏ" }.get(name, "๐น") round_num = (len(agent_responses) - 1) // len(agents) + 1 display(Markdown( f"---\n### {emoji} {name} (Message {len(agent_responses)}, Round {round_num})\n\n{content}" )) MAX_ROUNDS = 8 # 4 agents ร 2 rounds = 8 messages exactly task = f"""## Problem Statement {PROBLEM.strip()} ## Discussion Rules You are in a GROUP DISCUSSION with 4 members. You can see ALL previous messages. There are exactly 2 rounds. ### ROUND 1 (Messages 1-4): Independent Analysis - ClientAnalyst, NetworkAnalyst, ServerAnalyst: Analyze from YOUR domain only. Give your top 3 most likely causes with evidence and reasoning. - Coordinator: Note patterns across the 3 analyses. Ask 2-3 specific questions. Do NOT assign tasks yet. ### ROUND 2 (Messages 5-8): Cross-Examination & Final Report - ClientAnalyst, NetworkAnalyst, ServerAnalyst: You MUST reference the OTHER specialists BY NAME. State where you agree, disagree, or have new insights. Answer the Coordinator's questions. Provide SUBSTANTIVE analysis. - Coordinator: Produce the FINAL structured report: root cause, confidence levels, prioritized action plan with owners and time estimates. IMPORTANT: Each agent speaks as THEMSELVES only. Never impersonate another agent.""" display(Markdown(f"## Problem Statement\n\n{PROBLEM.strip()}")) display(Markdown(f"---\n## Discussion Starting โ {len(agents)} agents, {MAX_ROUNDS} rounds\n")) # Build and run orchestration = GroupChatOrchestration( members=agents, manager=RoundRobinGroupChatManager(max_rounds=MAX_ROUNDS), agent_response_callback=agent_response_callback, ) runtime = InProcessRuntime() runtime.start() result = await orchestration.invoke(task=task, runtime=runtime) final_result = await result.get(timeout=300) await runtime.stop_when_idle() display(Markdown(f"---\n## FINAL CONCLUSION\n\n{final_result}")) Cell 5: Statistics and Validation print("โ" * 55) print(" ANALYSIS STATISTICS") print("โ" * 55) emojis = {"ClientAnalyst": "๐ฅ๏ธ", "NetworkAnalyst": "๐", "ServerAnalyst": "โ๏ธ", "Coordinator": "๐ฏ"} agent_counts = {} agent_chars = {} for r in agent_responses: agent_counts[r["agent"]] = agent_counts.get(r["agent"], 0) + 1 agent_chars[r["agent"]] = agent_chars.get(r["agent"], 0) + len(r["content"]) for agent, count in agent_counts.items(): em = emojis.get(agent, "๐น") chars = agent_chars.get(agent, 0) avg = chars // count if count else 0 print(f" {em} {agent}: {count} msg(s), ~{chars:,} chars (avg {avg:,}/msg)") print(f"\n Total messages: {len(agent_responses)}") total_chars = sum(len(r['content']) for r in agent_responses) print(f" Total analysis: ~{total_chars:,} characters") # Validation print(f"\n Validation:") import re identity_issues = [] for r in agent_responses: other_agents = [a.name for a in agents if a.name != r["agent"]] for other in other_agents: pattern = rf'(?i)as {re.escape(other)}[,:]?\s+I\b' if re.search(pattern, r["content"][:300]): identity_issues.append(f"{r['agent']} impersonated {other}") if identity_issues: print(f" Identity confusion: {identity_issues}") else: print(f" No identity confusion detected") thin = [r for r in agent_responses if len(r["content"].strip()) < 100] if thin: for t in thin: print(f" Thin response from {t['agent']}") else: print(f" All responses are substantive") Cell 6: Save Report from datetime import datetime timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"analysis_report_{timestamp}.md" with open(filename, "w", encoding="utf-8") as f: f.write(f"# Problem Analysis Report\n\n") f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n") f.write(f"**Agents:** {', '.join(a.name for a in agents)}\n") f.write(f"**Rounds:** {MAX_ROUNDS}\n\n---\n\n") f.write(f"## Problem Statement\n\n{PROBLEM.strip()}\n\n---\n\n") for i, r in enumerate(agent_responses, 1): em = emojis.get(r['agent'], '๐น') round_num = (i - 1) // len(agents) + 1 f.write(f"### {em} {r['agent']} (Message {i}, Round {round_num})\n\n") f.write(f"{r['content']}\n\n---\n\n") f.write(f"## Final Conclusion\n\n{final_result}\n") print(f"Report saved to: {filename}") 7. The Agent Interaction Flow โ Round by Round Here's what actually happens during the 8-message orchestration: Round 1: Independent Analysis (Messages 1-4) Msg Agent What They See What They Do 1 ClientAnalyst Problem statement only Analyzes from client perspective: upload chunking, progress bar freezing at 85%, CORS, content-type headers 2 NetworkAnalyst Problem + ClientAnalyst's analysis Gives INDEPENDENT analysis despite seeing msg 1: load balancer timeouts, proxy body size limits, TCP window scaling 3 ServerAnalyst Problem + msgs 1-2 Gives INDEPENDENT analysis: recent deployment regression, request body parser, thread pool exhaustion, disk space 4 Coordinator Problem + msgs 1-3 Observes patterns: "All three mention timeout configuration. ClientAnalyst and NetworkAnalyst both point to body size. Question: Was the deployment a backend-only change or did it include infra?" Round 2: Cross-Examination (Messages 5-8) Msg Agent What They Do 5 ClientAnalyst "I agree with NetworkAnalyst that the load balancer timeout is likely a factor โ the 85% freeze point matches the 30-second LB timeout for a 10MB upload on our average upload speed. However, I disagree with ServerAnalyst about thread pool exhaustion because the UI shows a clean 504, not a connection reset." 6 NetworkAnalyst "ServerAnalyst's point about the recent deployment is critical โ if the new request parser is 3x slower, that would push uploads past the LB timeout. I can confirm the LB has a 30s idle timeout. The fix is both: increase LB timeout AND optimize the parser." 7 ServerAnalyst "Responding to Coordinator's question: The deployment was backend-only โ a new multipart parser using streaming instead of buffered reads. ClientAnalyst is correct that the 504 is from the LB, not the app. The app itself returns 200 after 45 seconds, but the LB kills the connection at 30." 8 Coordinator Produces final structured report with root cause: "The backend deployment introduced a slower multipart parser (45s vs 15s for 10MB). The load balancer's 30s timeout kills the connection at ~85% progress. Fix: immediate โ increase LB timeout to 120s. Short-term โ optimize parser. Long-term โ implement chunked uploads with progress resumption." Notice: The Round 2 analysis is dramatically better than Round 1. Agents reference each other by name, build on each other's findings, and the Coordinator can synthesize a cross-layer causal chain that no single agent could have produced. I made a small adjustment to the issue with Azure Web Apps. Please find the details below from testing carried out using Google Gemini: 8. Bugs I Found & Fixed โ Lessons Learned Building this demo taught me several important lessons about multi-agent systems: Bug 1: Agents Speaking Only Once Symptom: Only 4 messages instead of 8. Root cause: The agents list was missing the Coordinator. It was defined in a separate cell and wasn't included in the members list. Fix: All 4 agents must be in the same list passed to GroupChatOrchestration. Bug 2: NetworkAnalyst Says "I'm Ready to Proceed" Symptom: NetworkAnalyst's Round 2 response was just "I'm ready to proceed with the analysis" โ no actual content. Root cause: The Coordinator's Round 1 message was assigning tasks ("NetworkAnalyst, please check the load balancer config"), and the agent was acknowledging the assignment instead of analyzing. Fix: Added explicit constraint to Coordinator: "Round 1 MUST NOT assign tasks โ observation + questions ONLY." Bug 3: ServerAnalyst Says "As NetworkAnalyst, I..." Symptom: ServerAnalyst's response started with "As NetworkAnalyst, I believe..." Root cause: LLM identity bleeding. When agents share ChatHistory, the LLM sometimes loses track of which agent it's currently playing. This is especially common with Gemini. Fix: Identity anchoring at the very top of every agent's instructions: "You are ONLY ServerAnalyst. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator." Bug 4: Gemini Gives Thin/Empty Responses Symptom: Some agents responded with just one sentence or "I concur." Root cause: Gemini 2.5 Flash is more concise than GPT-4o by default. Without explicit length requirements, it takes shortcuts. Fix: Added "Every response MUST be at least 200 words" and "Answer the Coordinator's questions" to every specialist's instructions. Bug 5: Coordinator's Report is 18K Characters Symptom: The Coordinator's Round 2 response was absurdly long โ repeating everything every specialist said. Fix: Added word limits: "Round 1 max 300 words, Round 2 max 800 words" and "Synthesize, don't echo." Bug 6: MAX_ROUNDS Math Symptom: With MAX_ROUNDS=9, ClientAnalyst spoke a 3rd time after the Coordinator's final report โ breaking the clean 2-round structure. Fix: MAX_ROUNDS must equal (number of agents ร number of rounds). For 4 agents ร 2 rounds = 8. 9. Running with Different AI Providers The beauty of SK's Strategy Pattern is that you change ONE LINE to switch providers. Everything else โ agents, orchestration, callbacks, validation โ stays identical. Gemini setup: from semantic_kernel.connectors.ai.google.google_ai import GoogleAIChatCompletion service = GoogleAIChatCompletion( gemini_model_id="gemini-2.5-flash", api_key=os.getenv("GOOGLE_AI_API_KEY"), ) OpenAI Setup from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion service = OpenAIChatCompletion( ai_model_id="gpt-4o", api_key=os.getenv("OPEN_AI_API_KEY"), ) 10. What to Build Next Add Plugins to Agents Give agents real tools โ not just LLM reasoning - looks exciting right ;) class NetworkDiagnosticPlugin: (description="Pings a host and returns latency") def ping(self, host: str) -> str: result = subprocess.run(["ping", "-c", "3", host], capture_output=True, text=True) return result.stdout class LogSearchPlugin: (description="Searches server logs for error patterns") def search_logs(self, pattern: str, hours: int = 1) -> str: # Query your log aggregator (Splunk, ELK, Azure Monitor) return query_logs(pattern, hours) Add Filters for Governance Intercept every agent call for PII redaction and audit logging: .filter(filter_type=FilterTypes.FUNCTION_INVOCATION) async def audit_filter(context, next): print(f"[AUDIT] {context.function.name} called by agent") await next(context) print(f"[AUDIT] {context.function.name} returned") Try Different Orchestration Patterns Replace GroupChat with Sequential for a pipeline approach: # Instead of debate, each agent builds on the previous orchestration = SequentialOrchestration( members=[client_agent, network_agent, server_agent, coordinator_agent] ) Or Concurrent for parallel analysis: # All specialists analyze simultaneously, Coordinator aggregates orchestration = ConcurrentOrchestration( members=[client_agent, network_agent, server_agent] ) Deploy to Azure Move from InProcessRuntime to Azure Container Apps for production scaling. The agent code doesn't change โ only the runtime. Summary The key insight from building this demo: multi-agent systems produce better results than single agents not because each agent is smarter, but because the debate structure forces cross-domain thinking that a single prompt can never achieve. The Coordinator's final report consistently identifies causal chains that span client, network, and server layers โ exactly the kind of insight that production incident response teams need. Semantic Kernel makes this possible with clean separation of concerns: agents define WHAT to analyze, orchestration defines HOW they interact, the manager defines WHO speaks when, the runtime handles WHERE it executes, and callbacks let you OBSERVE everything. Each piece is independently swappable โ that's the power of SK from Microsoft. Resources: GitHub: github.com/microsoft/semantic-kernel Docs: learn.microsoft.com/semantic-kernel Orchestration Patterns: learn.microsoft.com/semantic-kernel/frameworks/agent/agent-orchestration Discord: aka.ms/sk/discord Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for building AI agents with SK (Semantic Kernel) ๐320Views3likes0CommentsI passed the GHโ900: GitHub Foundations exam!
Hi everyone, Iโm excited to share that I cleared the GHโ900 (GitHub Foundations) exam with a good score! This certification validates my understanding of Git, repository collaboration, pull requests, and GitHubโs core features. Preparation Approach: I studied using Microsoft Learn resources and the GHโ900 study guide. For extra practice and exam-style questions, I used dumps-4-azure โ it really gave me the extra edge for exam readiness. I also practiced hands-on with real GitHub workflows (branches, pull requests, projects) to reinforce my understanding. Key Takeaways: The exam tests foundational Git + GitHub collaboration skills โ not just theory. Practical experience combined with mock questions made a big difference. Consistency in daily preparation is the key. Next Steps: After GHโ900, Iโm planning to go for GHโ100 (GitHub Administration) to deepen my GitHub skills at the organizational level.175Views1like1CommentPAAS resource metrics using Azure Data Collection Rule to Log Analytics Workspace
Hi Team, I want to build a use case to pull the Azure PAAS resources metrics using azure DCR and push that data metrics to log analytics workspace which eventually will push the data to azure event hub through streaming and final destination as azure postgres to store all the resources metrics information in a centralized table and create KPIs and dashboard for the clients for better utilization of resources. I have not used diagnose setting enabling option since it has its cons like we need to manually enable each resources settings also we get limited information extracted from diagnose setting. But while implementing i saw multiple articles stating DCR is not used for pulling PAAS metrics its only compatible for VM metrics. Want to understand is it possible to use DCR for PAAS metrics? Thanks in advance for any inputs.Solved148Views0likes2CommentsApplying DevOps Principles on Lean Infrastructure. Lessons From Scaling to 102K Users.
Hi Azure Community, I'm a Microsoft Certified DevOps Engineer, and I want to share an unusual journey. I have been applying DevOps principles on traditional VPS infrastructure to scale to 102,000 users with 99.2% uptime. Why am I posting this in an Azure community? Because I'm planning migration to Azure in 2026, and I want to understand: What mistakes am I already making that will bite me during migration? THE CURRENT SETUP Platform: Social commerce (West Africa) Users: 102,000 active Monthly events: 2 million Uptime: 99.2% Infrastructure: Single VPS Stack: PHP/Laravel, MySQL, Redis Yes - one VPS. No cloud. No Kubernetes. No microservices. WHY I HAVEN'T USED AZURE YET Honest answer: Budget constraints in emerging market startup ecosystem. At our current scale, fully managed Azure services would significantly increase monthly burn before product-market expansion. The funding we raised needs to last through growth milestones. The trade: I manually optimize what Azure would auto-scale. I debug what Application Insights would catch. I do by hand what Azure Functions would automate. DEVOPS PRACTICES THAT KEPT US RUNNING Even on single-server infrastructure, core DevOps principles still apply: CI/CD Pipeline (GitHub Actions) โข 3-5 deployments weekly โข Zero-downtime deploys โข Automated rollback on health check failures โข Feature flags for gradual rollouts Monitoring & Observability โข Custom monitoring (would love Application Insights) โข Real-time alerting โข Performance tracking and slow query detection โข Resource usage monitoring Automation โข Automated backups โข Automated database optimization โข Automated image compression โข Automated security updates Infrastructure as Code โข Configs in Git โข Deployment scripts โข Environment variables โข Documented procedures Testing & Quality โข Automated test suite โข Pre-deployment health checks โข Staging environment โข Post-deployment verification KEY OPTIMIZATIONS Async Job Processing โข Upload endpoint: 8 seconds โ 340ms โข 4x capacity increase Database Optimization โข Feed loading: 6.4 seconds โ 280ms โข Strategic caching โข Batch processing Image Compression โข 3-8MB โ 180KB (94% reduction) โข Critical for mobile users Caching Strategy โข Redis for hot data โข Query result caching โข Smart invalidation Progressive Enhancement โข Server-rendered pages โข 2-3 second loads on 4G WHAT I'M WORRIED ABOUT FOR AZURE MIGRATION This is where I need your help: Architecture Decisions โข App Service vs Functions + managed services? โข MySQL vs Azure SQL? โข When does cost/benefit flip for managed services? Cost Management โข How do startups manage Azure costs during growth? โข Reserved instances vs pay-as-you-go? โข Which Azure services are worth the premium? Migration Strategy โข Lift-and-shift first, or re-architect immediately? โข Zero-downtime migration with 102K active users? โข Validation approach before full cutover? Monitoring & DevOps โข Application Insights - worth it from day one? โข Azure DevOps vs GitHub Actions for Azure deployments? โข Operational burden reduction with managed services? Development Workflow โข Local development against Azure services? โข Cost-effective staging environments? โข Testing Azure features without constant bills? MY PLANNED MIGRATION PATH Phase 1: Hybrid (Q1 2026) โข Azure CDN for static assets โข Azure Blob Storage for images โข Application Insights trial โข Keep compute on VPS Phase 2: Compute Migration (Q2 2026) โข App Service for API โข Azure Database for MySQL โข Azure Cache for Redis โข VPS for background jobs Phase 3: Full Azure (Q3 2026) โข Azure Functions for processing โข Full managed services โข Retire VPS QUESTIONS FOR THIS COMMUNITY Question 1: Am I making migration harder by waiting? Should I have started with Azure at higher cost to avoid technical debt? Question 2: What will break when I migrate? What works on VPS but fails in cloud? What assumptions won't hold? Question 3: How do I validate before cutting over? Parallel infrastructure? Gradual traffic shift? Safe patterns? Question 4: Cost optimization from day one? What to optimize immediately vs later? Common cost mistakes? Question 5: DevOps practices that transfer? What stays the same? What needs rethinking for cloud-native? THE BIGGER QUESTION Have you migrated from self-hosted to Azure? What surprised you? I know my setup isn't best practice by Azure standards. But it's working, and I've learned optimization, monitoring, and DevOps fundamentals in practice. Will those lessons transfer? Or am I building habits that cloud will expose as problematic? Looking forward to insights from folks who've made similar migrations. --- About the Author: Microsoft Certified DevOps Engineer and Azure Developer. CTO at social commerce platform scaling in West Africa. Preparing for phased Azure migration in 2026. P.S. I got the Azure certifications to prepare for this migration. Now I need real-world wisdom from people who've actually done it!120Views0likes0Comments๐ Strengthening Azure DNS Zone Security with RBAC and Resource Locks
๐ DNS security is more than just configuration itโs about protecting critical assets against unauthorized changes and accidental deletions. ๐ Managing DNS zones effectively requires a layered security approach. ๐ Two powerful mechanisms in Azure : Role-Based Access Control (RBAC) and Resource Locks ๐ Role-Based Access Control (RBAC) ๐ * Granular DNS Access Control * RBAC ensures controlled access management at both the DNS zone and record set levels. * Instead of assigning broad permissions, RBAC enables precise delegation using built-in roles such as: ๐น Owner โ Full control over the DNS zone, including configurations and deletions. ๐น Contributor โ Can modify DNS settings but cannot change access permissions. ๐น Network Contributor โ Can manage networking configurations related to DNS, but not modify records. ๐น DNS Zone Contributor โ Dedicated role for managing DNS zones without broader networking privileges. โ Key Advantages of RBAC in DNS Security: โ Prevent unauthorized modifications by restricting access to only necessary roles. โ Ensure operational integrity by limiting exposure to critical configurations. โ Improve governance by aligning roles with organizational security policies. ๐ Resource Locks ๐ * Guardrails for DNS Protection * Even with well-defined RBAC settings, accidental deletions can still occur. * Azure Resource Locks add an additional safeguard by preventing changes to a DNS zone or specific record sets. ๐น Zone Lock ----> Protects an entire DNS zone from being deleted, preserving all associated record sets. ๐น SOA Lock ----> Prevents unintentional zone deletions while allowing record modifications within the zone. โ How Resource Locks Enhance Security: โ Shields DNS zones from accidental or malicious deletions. โ Maintains continuity by ensuring record sets remain intact. โ Strengthens compliance controls for critical infrastructure. ๐ Best Practices for Securing DNS with RBAC & Resource Locks ๐ธ Assign least privilege rolesโnever give unnecessary access. ๐ธ Implement locks on essential zones to prevent configuration errors. ๐ธ Regularly audit access permissions using Azure Policy & Activity Logs. ๐ธ Use Automation & Alerts to track modifications for enhanced security. ๐น Implementing RBAC & Resource Locks ensures your cloud environment remains secure, operational, and fault-tolerant.557Views0likes1CommentRHEL In-place upgrades and Azure Update Manager
Following the process in this article will cause a disconnection between the data plane and the control plane of the virtual machine (VM). Azure capabilities such as Auto guest patching, Auto OS image upgrades, Hotpatching, and Azure Update Manager won't be available. To utilize these features, it's recommended to create a new VM using your preferred operating system instead of performing an in-place upgrade. According to https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/redhat/redhat-in-place-upgrade, Azure Update Manager will break if any RHEL in-place upgrades are performed due to data/control plane disconnect. As a Microsoft product, this dilemma seems to defeat the benefits of AUM if you're someone like me who uses Redhat 'pet' VMs (as opposed to 'cattle' VMs) for work, and would frankly like to centralize all operations within the lifecycle of a Linux box inside the Azure tenant (patching, upgrading, rollback, any possible automation/application deployment etc). Unfortunately it would seem that this issue is largely something outside of the Azure customer's control. So, to anyone with esoteric Azure knowledge: what gives? Why and how is there a data disconnect between the control planes? What does the process look like from a bird's eye view? Given that the issue exists in the first place I would imagine that there is some kind of developmental contradiction, otherwise a feature like this probably would have been figured out a while ago (or that it is, as I suspect, simply not high priority enough despite a solution which may already exist in development). Furthermore, for those who may have more intimate info on the matter, does any sort of discussion or planning of a solution for this issue exist? With kindness, MadDogOfShimano292Views0likes2CommentsMine your Azure backup data, it could save you ๐ฐ๐ก
Your data has a story to tell. Mine it, decipher it, and turn it into actionable outcomes. ๐๐ Azure backups can become orphaned in several ways (I'll dive into that in a future post). But hereโs a key point: orphaned doesnโt always mean useless, hence the word โPotentialโ in the title of my Power BI report. Each workload needs to be assessed individually. If a backup is no longer needed, you might be paying for it - unnecessarily and unknowingly. ๐ต๏ธโโ๏ธ๐ธ To uncover these hidden costs, I combined data from the Azure Business Continuity Center with a PowerShell script I wrote to extract LastBackupTime and other metadata. This forms the foundation of my report, helping visualize and track backup usage over time. This approach helped me identify forgotten one-time backups, VMs deleted without stopping the backup, workloads excluded due to policy changes, and backups left behind after resource migrations. If you delete unneeded backups and have soft-delete enabled, the backup size drops to zero and Azure stops charging for it. โ ๐งน ๐ก Do your Azure backups have their own untold story to tell? ๐ธ Here's a snapshot of my report that helped me uncover these insights ๐108Views0likes0Comments