devops
420 TopicsLegacy SSRS reports after upgrading Azure DevOps Server 2020 to 2022 or 25H2
We are currently planning an upgrade from Azure DevOps Server 2020 to Azure DevOps Server 2022 or 25H2, and one of our biggest concerns is reporting. We understand that Microsoft’s recommended direction is to move to Power BI based on Analytics / OData. However, for on-prem environments with a large number of existing SSRS reports, rebuilding everything from scratch would require significant time and effort. Since Warehouse and Analysis Services are no longer available in newer versions, we would like to understand how other on-prem teams are handling legacy SSRS reporting during and after the upgrade. Have you rebuilt your reports in Power BI, moved to another reporting approach, or found a practical way to keep existing SSRS reports available during the transition? Any real-world experience, lessons learned, or recommended approaches would be greatly appreciated.144Views0likes3CommentsDevOps REST API - identity picker (query on list of users)
I'm using DevOps REST API via OAuth 2.0 to populate the fields of work item types. For "identity" fields, such as "System.AssignedTo", I'm having a hard time trying to figure out the best API that allows to retrieve a searchable list of users that mirrors what users see on DevOps website. From the browser inspector I saw the website calls this API, which is not documented: [POST] https://dev.azure.com/MY_ORGANIZATION/_apis/IdentityPicker/Identities as also noted on another discussion. But when I call this API (with the very same request body) from my local server I get a 401 response status code, and HTML content instead of the anticipated JSON. In Microsoft Entra Admin Center, I made sure to include "vso.identity" API permission for my app registration. What am I missing here? If I cannot use this API, what's the best alternative? I saw the https://learn.microsoft.com/en-us/rest/api/azure/devops/ims/identities/read-identities?view=azure-devops-rest-7.0&tabs=HTTP, but when I try to load it on the browser I always get zero results. E.g. https://vssps.dev.azure.com/MY_ORGANIZATION/_apis/identities?api-version=7.0&searchFilter=DisplayName&filterValue=SEARCH_TERM { "count": 0, "value": [] } Also, all the REST APIs I used so far are on https://dev.azure.com. How is https://vssps.dev.azure.com any different? Can I call APIs on a different host with the same OAuth access token?436Views0likes1CommentAzure DevOps REST API - Obtain all Build Policies runs of a Pull Request.
I have recently started using the Azure DevOps REST API to obtain some information in order to store it and later use it. The problem I have encountered is that I can't seem to find an easy way to obtain all of the build policies runs that have been requested for a Pull Request (just the build policies that are builds or pipelines). My understanding is that I can obtain the latest run of the build policies for a pull request, how ever I am interested in finding all, not just the most recent one. The only way I found to obtain this is by first finding out all build policies a pull request must run, and then for each of them find out all of their runs (by using their 'definition'), and then filtering to find just the ones associated to the pull request I want. My question is, is there an easier way to do this? Or is this the only way?77Views0likes1CommentHow to recover global admin access to tenant
I have already tried posting this to the general Microsoft Q&A forums and received no response. We are desperate to figure something out so if this is not the correct line of communication, please direct me to where I should go. My company is in a bit of a bind right now, and I am at my wit's end after almost a week of trying to get in contact with anyone who could help. We have multiple directories in Azure that belong to us, but they are all independent of each other. As such, some directories have multiple global admins (and thus are not an issue); others -- and quite frankly, the most important ones -- only have one global admin, and it was our DevOps person, who is no longer employed with us. We have no way of accessing his account, and thus no way of accessing a global admin account for these directories/tenants. Access to these directories is critical to our operations. We were informed last Friday by someone from the data protection team that they could not give us access to these tenants we pay thousands of dollars a month for because: Our former DevOps person registered all other users as guests/external users, and DPT "can't give external users admin permissions", and To reset the MFA of the current global admin account, the owner of the account (who no longer works for our company) would need to contact them and verify their identity What options do we have here? We have blobs full of user-uploaded files in these tenants. Starting over from scratch is a doomsday scenario we are trying everything we can to avoid. Surely there has to be something that can be done?41Views0likes2CommentsApplying DevOps Principles on Lean Infrastructure. Lessons From Scaling to 102K Users.
Hi Azure Community, I'm a Microsoft Certified DevOps Engineer, and I want to share an unusual journey. I have been applying DevOps principles on traditional VPS infrastructure to scale to 102,000 users with 99.2% uptime. Why am I posting this in an Azure community? Because I'm planning migration to Azure in 2026, and I want to understand: What mistakes am I already making that will bite me during migration? THE CURRENT SETUP Platform: Social commerce (West Africa) Users: 102,000 active Monthly events: 2 million Uptime: 99.2% Infrastructure: Single VPS Stack: PHP/Laravel, MySQL, Redis Yes - one VPS. No cloud. No Kubernetes. No microservices. WHY I HAVEN'T USED AZURE YET Honest answer: Budget constraints in emerging market startup ecosystem. At our current scale, fully managed Azure services would significantly increase monthly burn before product-market expansion. The funding we raised needs to last through growth milestones. The trade: I manually optimize what Azure would auto-scale. I debug what Application Insights would catch. I do by hand what Azure Functions would automate. DEVOPS PRACTICES THAT KEPT US RUNNING Even on single-server infrastructure, core DevOps principles still apply: CI/CD Pipeline (GitHub Actions) • 3-5 deployments weekly • Zero-downtime deploys • Automated rollback on health check failures • Feature flags for gradual rollouts Monitoring & Observability • Custom monitoring (would love Application Insights) • Real-time alerting • Performance tracking and slow query detection • Resource usage monitoring Automation • Automated backups • Automated database optimization • Automated image compression • Automated security updates Infrastructure as Code • Configs in Git • Deployment scripts • Environment variables • Documented procedures Testing & Quality • Automated test suite • Pre-deployment health checks • Staging environment • Post-deployment verification KEY OPTIMIZATIONS Async Job Processing • Upload endpoint: 8 seconds → 340ms • 4x capacity increase Database Optimization • Feed loading: 6.4 seconds → 280ms • Strategic caching • Batch processing Image Compression • 3-8MB → 180KB (94% reduction) • Critical for mobile users Caching Strategy • Redis for hot data • Query result caching • Smart invalidation Progressive Enhancement • Server-rendered pages • 2-3 second loads on 4G WHAT I'M WORRIED ABOUT FOR AZURE MIGRATION This is where I need your help: Architecture Decisions • App Service vs Functions + managed services? • MySQL vs Azure SQL? • When does cost/benefit flip for managed services? Cost Management • How do startups manage Azure costs during growth? • Reserved instances vs pay-as-you-go? • Which Azure services are worth the premium? Migration Strategy • Lift-and-shift first, or re-architect immediately? • Zero-downtime migration with 102K active users? • Validation approach before full cutover? Monitoring & DevOps • Application Insights - worth it from day one? • Azure DevOps vs GitHub Actions for Azure deployments? • Operational burden reduction with managed services? Development Workflow • Local development against Azure services? • Cost-effective staging environments? • Testing Azure features without constant bills? MY PLANNED MIGRATION PATH Phase 1: Hybrid (Q1 2026) • Azure CDN for static assets • Azure Blob Storage for images • Application Insights trial • Keep compute on VPS Phase 2: Compute Migration (Q2 2026) • App Service for API • Azure Database for MySQL • Azure Cache for Redis • VPS for background jobs Phase 3: Full Azure (Q3 2026) • Azure Functions for processing • Full managed services • Retire VPS QUESTIONS FOR THIS COMMUNITY Question 1: Am I making migration harder by waiting? Should I have started with Azure at higher cost to avoid technical debt? Question 2: What will break when I migrate? What works on VPS but fails in cloud? What assumptions won't hold? Question 3: How do I validate before cutting over? Parallel infrastructure? Gradual traffic shift? Safe patterns? Question 4: Cost optimization from day one? What to optimize immediately vs later? Common cost mistakes? Question 5: DevOps practices that transfer? What stays the same? What needs rethinking for cloud-native? THE BIGGER QUESTION Have you migrated from self-hosted to Azure? What surprised you? I know my setup isn't best practice by Azure standards. But it's working, and I've learned optimization, monitoring, and DevOps fundamentals in practice. Will those lessons transfer? Or am I building habits that cloud will expose as problematic? Looking forward to insights from folks who've made similar migrations. --- About the Author: Microsoft Certified DevOps Engineer and Azure Developer. CTO at social commerce platform scaling in West Africa. Preparing for phased Azure migration in 2026. P.S. I got the Azure certifications to prepare for this migration. Now I need real-world wisdom from people who've actually done it!146Views0likes1CommentHow to Fix Azure Event Grid Entra Authentication issue for ACS and Dynamics 365 integrated Webhooks
Introduction: Azure Event Grid is a powerful event routing service that enables event-driven architectures in Azure. When delivering events to webhook endpoints, security becomes paramount. Microsoft provides a secure webhook delivery mechanism using Microsoft Entra ID (formerly Azure Active Directory) authentication through the AzureEventGridSecureWebhookSubscriber role. Problem Statement: When integrating Azure Communication Services with Dynamics 365 Contact Center using Microsoft Entra ID-authenticated Event Grid webhooks, the Event Grid subscription deployment fails with an error: "HTTP POST request failed with unknown error code" with empty HTTP status and code. For example: Important Note: Before moving forward, please verify that you have the Owner role assigned on app to create event subscription. Refer to the Microsoft guidelines below to validate the required prerequisites before proceeding: Set up incoming calls, call recording, and SMS services | Microsoft Learn Why This Happens: This happens because AzureEventGridSecureWebhookSubscriber role is NOT properly configured on Microsoft EventGrid SP (Service Principal) and event subscription entra ID or application who is trying to create event grid subscription. What is AzureEventGridSecureWebhookSubscriber Role: The AzureEventGridSecureWebhookSubscriber is an Azure Entra application role that: Enables your application to verify the identity of event senders Allows specific users/applications to create event subscriptions Authorizes Event Grid to deliver events to your webhook How It Works: Role Creation: You create this app role in your destination webhook application's Azure Entra registration Role Assignment: You assign this role to: Microsoft Event Grid service principal (so it can deliver events) Either Entra ID / Entra User or Event subscription creator applications (so they can create event grid subscriptions) Token Validation: When Event Grid delivers events, it includes an Azure Entra token with this role claim Authorization Check: Your webhook validates the token and checks for the role Key Participants: Webhook Application (Your App) Purpose: Receives and processes events App Registration: Created in Azure Entra Contains: The AzureEventGridSecureWebhookSubscriber app role Validates: Incoming tokens from Event Grid Microsoft Event Grid Service Principal Purpose: Delivers events to webhooks App ID: Different per Azure cloud (Public, Government, etc.) Public Azure: 4962773b-9cdb-44cf-a8bf-237846a00ab7 Needs: AzureEventGridSecureWebhookSubscriber role assigned Event Subscription Creator Entra or Application Purpose: Creates event subscriptions Could be: You, Your deployment pipeline, admin tool, or another application Needs: AzureEventGridSecureWebhookSubscriber role assigned Although the full PowerShell script is documented in the below Event Grid documentation, it may be complex to interpret and troubleshoot. Azure PowerShell - Secure WebHook delivery with Microsoft Entra Application in Azure Event Grid - Azure Event Grid | Microsoft Learn To improve accessibility, the following section provides a simplified step-by-step tested solution along with verification steps suitable for all users including non-technical: Steps: STEP 1: Verify/Create Microsoft.EventGrid Service Principal Azure Portal → Microsoft Entra ID → Enterprise applications Change filter to Application type: Microsoft Applications Search for: Microsoft.EventGrid Ideally, your Azure subscription should include this application ID, which is common across all Azure subscriptions: 4962773b-9cdb-44cf-a8bf-237846a00ab7. If this application ID is not present, please contact your Azure Cloud Administrator. STEP 2: Create the App Role "AzureEventGridSecureWebhookSubscriber" Using Azure Portal: Navigate to your Webhook App Registration: Azure Portal → Microsoft Entra ID → App registrations Click All applications Find your app by searching OR use the Object ID you have Click on your app Create the App Role: Display name: AzureEventGridSecureWebhookSubscriber Allowed member types: Both (Users/Groups + Applications) Value: AzureEventGridSecureWebhookSubscriber Description: Azure Event Grid Role Do you want to enable this app role?: Yes In left menu, click App roles Click + Create app role Fill in the form: Click Apply STEP 3: Assign YOUR USER to the Role Using Azure Portal: Switch to Enterprise Application view: Azure Portal → Microsoft Entra ID → Enterprise applications Search for your webhook app (by name) Click on it Assign yourself: In left menu, click Users and groups Click + Add user/group Under Users, click None Selected Search for your user account (use your email) Select yourself Click Select Under Select a role, click None Selected Select AzureEventGridSecureWebhookSubscriber Click Select Click Assign STEP 4: Assign Microsoft.EventGrid Service Principal to the Role This step MUST be done via PowerShell or Azure CLI (Portal doesn't support this directly as we have seen) so PowerShell is recommended You will need to execute this step with the help of your Entra admin. # Connect to Microsoft Graph Connect-MgGraph -Scopes "AppRoleAssignment.ReadWrite.All" # Replace this with your webhook app's Application (client) ID $webhookAppId = "YOUR-WEBHOOK-APP-ID-HERE" #starting with c5 # Get your webhook app's service principal $webhookSP = Get-MgServicePrincipal -Filter "appId eq '$webhookAppId'" Write-Host " Found webhook app: $($webhookSP.DisplayName)" # Get Event Grid service principal $eventGridSP = Get-MgServicePrincipal -Filter "appId eq '4962773b-9cdb-44cf-a8bf-237846a00ab7'" Write-Host " Found Event Grid service principal" # Get the app role $appRole = $webhookSP.AppRoles | Where-Object {$_.Value -eq "AzureEventGridSecureWebhookSubscriber"} Write-Host " Found app role: $($appRole.DisplayName)" # Create the assignment New-MgServicePrincipalAppRoleAssignment ` -ServicePrincipalId $eventGridSP.Id ` -PrincipalId $eventGridSP.Id ` -ResourceId $webhookSP.Id ` -AppRoleId $appRole.Id Write-Host "Successfully assigned Event Grid to your webhook app!" Verification Steps: Verify the App Role was created: Your App Registration → App roles You should see: AzureEventGridSecureWebhookSubscriber Verify your user assignment: Enterprise application (your webhook app) → Users and groups You should see your user with role AzureEventGridSecureWebhookSubscriber Verify Event Grid assignment: Same location → Users and groups You should see Microsoft.EventGrid with role AzureEventGridSecureWebhookSubscriber Sample Flow: Analogy For Simplification: Lets think it similar to the construction site bulding where you are the owner of the building. Building = Azure Entra app (webhook app) Building (Azure Entra App Registration for Webhook) ├─ Building Name: "MyWebhook-App" ├─ Building Address: Application ID ├─ Building Owner: You ├─ Security System: App Roles (the security badges you create) └─ Security Team: Azure Entra and your actual webhook auth code (which validates tokens) like doorman Step 1: Creat the badge (App role) You (the building owner) create a special badge: - Badge name: "AzureEventGridSecureWebhookSubscriber" - Badge color: Let's say it's GOLD - Who can have it: Companies (Applications) and People (Users) This badge is stored in your building's system (Webhook App Registration) Step 2: Give badge to the Event Grid Service: Event Grid: "Hey, I need to deliver messages to your building" You: "Okay, here's a GOLD badge for your SP" Event Grid: *wears the badge* Now Event Grid can: - Show the badge to Azure Entra - Get tokens that say "I have the GOLD badge" - Deliver messages to your webhook Step 3: Give badge to yourself (or your deployment tool) You also need a GOLD badge because: - You want to create event grid event subscriptions - Entra checks: "Does this person have a GOLD badge?" - If yes: You can create subscriptions - If no: "Access denied" Your deployment pipeline also gets a GOLD badge: - So it can automatically set up event subscriptions during CI/CD deployments Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for this specific integration use case 😀464Views4likes1CommentBuilding Multi-Agent Orchestration Using Microsoft Semantic Kernel: A Complete Step-by-Step Guide
What You Will Build By the end of this guide, you will have a working multi-agent system where 4 specialist AI agents collaborate to diagnose production issues: ClientAnalyst — Analyzes browser, JavaScript, CORS, uploads, and UI symptoms NetworkAnalyst — Analyzes DNS, TCP/IP, TLS, load balancers, and firewalls ServerAnalyst — Analyzes backend logs, database, deployments, and resource limits Coordinator — Synthesizes all findings into a root cause report with a prioritized action plan These agents don't just run in sequence — they debate, cross-examine, and challenge each other's findings through a shared conversation, producing a diagnosis that's better than any single agent could achieve alone. Table of Contents Why Multi-Agent? The Problem with Single Agents Architecture Overview Understanding the Key SK Components The Actor Model — How InProcessRuntime Works Setting Up Your Development Environment Step-by-Step: Building the Multi-Agent Analyzer The Agent Interaction Flow — Round by Round Bugs I Found & Fixed — Lessons Learned Running with Different AI Providers What to Build Next 1. Why Multi-Agent? The Problem with Single Agents A single AI agent analyzing a production issue is like having one doctor diagnose everything — they'll catch issues in their specialty but miss cross-domain connections. Consider this problem: "Users report 504 Gateway Timeout errors when uploading files larger than 10MB. Started after Friday's deployment. Worse during peak hours." A single agent might say "it's a server timeout" and stop. But the real root cause often spans multiple layers: The client is sending chunked uploads with an incorrect Content-Length header (client-side bug) The load balancer has a 30-second timeout that's too short for large uploads (network config) The server recently deployed a new request body parser that's 3x slower (server-side regression) The combination only fails during peak hours because connection pool saturation amplifies the latency No single perspective catches this. You need specialists who analyze independently, then debate to find the cross-layer causal chain. That's what multi-agent orchestration gives you. The 5 Orchestration Patterns in SK Semantic Kernel provides 5 built-in patterns for agent collaboration: SEQUENTIAL: A → B → C → Done (pipeline — each builds on previous) CONCURRENT: ↗ A ↘ Task → B → Aggregate ↘ C ↗ (parallel — results merged) GROUP CHAT: A ↔ B ↔ C ↔ D ← We use this one (rounds, shared history, debate) HANDOFF: A → (stuck?) → B → (complex?) → Human (escalation with human-in-the-loop) MAGENTIC: LLM picks who speaks next dynamically (AI-driven speaker selection) We use GroupChatOrchestration with RoundRobinGroupChatManager because our problem requires agents to see each other's work, challenge assumptions, and build on each other's analysis across two rounds. 2. Architecture Overview Here's the complete architecture of what we're building: 3. Understanding the Key SK Components Before we write code, let's understand the 5 components we'll use and the design pattern each implements: ChatCompletionAgent — Strategy Pattern The agent definition. Each agent is a combination of: name — unique identifier (used in round-robin ordering) instructions — the persona and rules (this is the prompt engineering) service — which AI provider to call (Strategy Pattern — swap providers without changing agent logic) description — what other agents/tools understand about this agent agent = ChatCompletionAgent( name="ClientAnalyst", instructions="You are ONLY ClientAnalyst...", service=gemini_service, # ← Strategy: swap to OpenAI with zero changes description="Analyzes client-side issues", ) GroupChatOrchestration — Mediator Pattern The orchestration defines HOW agents interact. It's the Mediator — agents don't talk to each other directly. Instead, the orchestration manages a shared ChatHistory and routes messages through the Manager. RoundRobinGroupChatManager — Strategy Pattern The Manager decides WHO speaks next. RoundRobinGroupChatManager cycles through agents in a fixed order. SK also provides AutomaticGroupChatManager where the LLM decides who speaks next. max_rounds is the total number of messages per agent or cycle. With 4 agents and max_rounds=8, each agent speaks exactly twice. InProcessRuntime — Actor Model Abstraction The execution engine. Every agent becomes an "actor" with its own kind of mailbox (message queue). The runtime delivers messages between actors. Key properties: No shared state — agents communicate only through messages Sequential processing — each agent processes one message at a time Location transparency — same code works in-process today, distributed tomorrow agent_response_callback — Observer Pattern A function that fires after EVERY agent response. We use it to display each agent's output in real-time with emoji labels and round numbers. 4. The Actor Model — How InProcessRuntime Works The Actor Model is a concurrency pattern where each entity is an isolated "actor" with a private mailbox. Here's what happens inside InProcessRuntime when we run our demo: runtime.start() │ ├── Creates internal message loop (asyncio event loop) │ orchestration.invoke(task="504 timeout...", runtime=runtime) │ ├── Creates Actor[Orchestrator] → manages overall flow ├── Creates Actor[Manager] → RoundRobinGroupChatManager ├── Creates Actor[ClientAnalyst] → mailbox created, waiting ├── Creates Actor[NetworkAnalyst] → mailbox created, waiting ├── Creates Actor[ServerAnalyst] → mailbox created, waiting └── Creates Actor[Coordinator] → mailbox created, waiting Manager receives "start" message │ ├── Checks turn order: [Client, Network, Server, Coordinator] ├── Sends task to ClientAnalyst mailbox │ → ClientAnalyst processes: calls LLM → response │ → Response added to shared ChatHistory │ → callback fires (displayed in Notebook UI) │ → Sends "done" back to Manager │ ├── Manager updates: turn_index=1 ├── Sends to NetworkAnalyst mailbox │ → Same flow... │ ├── ... (ServerAnalyst, Coordinator for Round 1) │ ├── Manager checks: messages=4, max_rounds=8 → continue │ ├── Round 2: same cycle with cross-examination │ └── After message 8: Manager sends "complete" → OrchestrationResult resolves → result.get() returns final answer runtime.stop_when_idle() → All mailboxes empty → clean shutdown The Actor Model guarantees: No race conditions (each actor processes one message at a time) No deadlocks (no shared locks to contend for) No shared mutable state (agents communicate only via messages) 5. Setting Up Your Development Environment Prerequisites Python 3.11 or 3.12 (3.13+ may have compatibility issues with some SK connectors) Visual Studio Code with the Python and Jupyter extensions An API key from one of: Google AI Studio (free), OpenAI Step 1: Install Python Download from python.org. During installation, check "Add Python to PATH". Verify: python --version # Python 3.12.x Step 2: Install VS Code Extensions Open VS Code, go to Extensions (Ctrl+Shift+X), and install: Python (by Microsoft) — Python language support Jupyter (by Microsoft) — Notebook support Pylance (by Microsoft) — IntelliSense and type checking Step 3: Create Project Folder mkdir sk-multiagent-demo cd sk-multiagent-demo Open in VS Code: code . Step 4: Create Virtual Environment Open the VS Code terminal (Ctrl+`) and run: # Create virtual environment python -m venv sk-env # Activate it # Windows: sk-env\Scripts\activate # macOS/Linux: source sk-env/bin/activate You should see (sk-env) in your terminal prompt. Step 5: Install Semantic Kernel For Google Gemini (free tier — recommended for getting started): pip install semantic-kernel[google] python-dotenv ipykernel For OpenAI (paid API key): pip install semantic-kernel openai python-dotenv ipykernel For Azure AI Foundry (enterprise, Entra ID auth): pip install semantic-kernel azure-identity python-dotenv ipykernel Step 6: Register the Jupyter Kernel python -m ipykernel install --user --name=sk-env --display-name="Semantic Kernel (Python 3.12)" You can also select if this is already available from your environment from VSCode as below: Step 7: Get Your API Key Option A — Google Gemini (FREE, recommended for demo): Go to https://aistudio.google.com/apikey Click "Create API Key" Copy the key Free tier limits: 15 requests/minute, 1 million tokens/minute — more than enough for this demo. Option B — OpenAI: Go to https://platform.openai.com/api-keys Create a new key Copy the key Option C — Azure AI Foundry: Deploy a model in Azure AI Foundry portal Note the endpoint URL and deployment name If key-based auth is disabled, you'll need Entra ID with permissions Step 8: Create the .env File In your project root, create a file named .env: For Gemini: GOOGLE_AI_API_KEY=AIzaSy...your-key-here GOOGLE_AI_GEMINI_MODEL_ID=gemini-2.5-flash For OpenAI: OPENAI_API_KEY=sk-...your-key-here OPENAI_CHAT_MODEL_ID=gpt-4o For Azure AI Foundry: AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4o AZURE_OPENAI_API_KEY=your-key Step 9: Create the Notebook In VS Code: Click File > New File Save as multi_agent_analyzer.ipynb In the top-right of the notebook, click Select Kernel Choose Semantic Kernel (Python 3.12) (or your sk-env) Your environment is ready. Let's build. 6. Step-by-Step: Building the Multi-Agent Analyzer Cell 1: Verify Setup import semantic_kernel print(f"Semantic Kernel version: {semantic_kernel.__version__}") from semantic_kernel.agents import ( ChatCompletionAgent, GroupChatOrchestration, RoundRobinGroupChatManager, ) from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent print("All imports successful") Cell 2: Load API Key and Create Service For Gemini: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.google.google_ai import ( GoogleAIChatCompletion, GoogleAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory GEMINI_API_KEY = os.getenv("GOOGLE_AI_API_KEY") GEMINI_MODEL = os.getenv("GOOGLE_AI_GEMINI_MODEL_ID", "gemini-2.5-flash") service = GoogleAIChatCompletion( gemini_model_id=GEMINI_MODEL, api_key=GEMINI_API_KEY, ) print(f"Service created: Gemini {GEMINI_MODEL}") # Smoke test settings = GoogleAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") For OpenAI: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.open_ai import ( OpenAIChatCompletion, OpenAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory service = OpenAIChatCompletion( ai_model_id=os.getenv("OPENAI_CHAT_MODEL_ID", "gpt-4o"), ) print(f"Service created: OpenAI {os.getenv('OPENAI_CHAT_MODEL_ID', 'gpt-4o')}") # Smoke test settings = OpenAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") Cell 3: Define All 4 Agents This is the most important cell — the prompt engineering that makes the demo work: from semantic_kernel.agents import ChatCompletionAgent # ═══════════════════════════════════════════════════ # AGENT 1: Client-Side Analyst # ═══════════════════════════════════════════════════ client_agent = ChatCompletionAgent( name="ClientAnalyst", description="Analyzes problems from the client-side: browser, JS, CORS, caching, UI symptoms", instructions="""You are ONLY **ClientAnalyst**. You must NEVER speak as NetworkAnalyst, ServerAnalyst, or Coordinator. Every word you write is from ClientAnalyst's perspective only. You are a senior front-end and client-side diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the client side: 1. **Browser & Rendering**: DOM issues, JavaScript errors, CSS rendering, browser compatibility, memory leaks, console errors. 2. **Client-Side Caching**: Stale cache, service worker issues, local storage corruption. 3. **Network from Client View**: CORS errors, preflight failures, request timeouts, client-side retry storms, fetch/XHR configuration. 4. **Upload Handling**: File API usage, chunk upload implementation, progress tracking, FormData construction, content-type headers. 5. **UI/UX Symptoms**: What the user sees, error messages displayed, loading states. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference NetworkAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the client perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 2: Network Analyst # ═══════════════════════════════════════════════════ network_agent = ChatCompletionAgent( name="NetworkAnalyst", description="Analyzes problems from the network side: DNS, TCP, TLS, firewalls, load balancers, latency", instructions="""You are ONLY **NetworkAnalyst**. You must NEVER speak as ClientAnalyst, ServerAnalyst, or Coordinator. Every word you write is from NetworkAnalyst's perspective only. You are a senior network infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the network layer: 1. **DNS & Resolution**: DNS TTL, propagation delays, record misconfigurations. 2. **TCP/IP & Connections**: Connection pooling, keep-alive, TCP window scaling, connection resets, SYN floods. 3. **TLS/SSL**: Certificate issues, handshake failures, protocol version mismatches. 4. **Load Balancers & Proxies**: Sticky sessions, health checks, timeout configs, request body size limits, proxy buffering. 5. **Firewall & WAF**: Rule blocks, rate limiting, request inspection delays, geo-blocking, DDoS protection interference. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the network perspective - Do NOT just say 'I am ready to proceed' — provide substantive technical analysis Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 3: Server-Side Analyst # ═══════════════════════════════════════════════════ server_agent = ChatCompletionAgent( name="ServerAnalyst", description="Analyzes problems from the server side: backend app, database, logs, resources, deployments", instructions="""You are ONLY **ServerAnalyst**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator. Every word you write is from ServerAnalyst's perspective only. You are a senior backend and infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the server side: 1. **Application Server**: Error logs, exception traces, thread pool exhaustion, memory leaks, CPU spikes, garbage collection pauses. 2. **Database**: Slow queries, connection pool saturation, lock contention, deadlocks, replication lag, query plan changes. 3. **Deployment & Config**: Recent deployments, configuration changes, feature flags, environment variable mismatches, rollback candidates. 4. **Resource Limits**: File upload size limits, request body limits, disk space, temporary file cleanup, storage quotas. 5. **External Dependencies**: Upstream API timeouts, third-party service degradation, queue backlogs, cache (Redis/Memcached) issues. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and NetworkAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the server perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 4: Coordinator # ═══════════════════════════════════════════════════ coordinator_agent = ChatCompletionAgent( name="Coordinator", description="Synthesizes all specialist analyses into a final root cause report with prioritized action plan", instructions="""You are ONLY **Coordinator**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or ServerAnalyst. You synthesize — you do NOT do domain-specific analysis. You are the lead engineer who synthesizes the team's findings. ═══ ROUND 1 BEHAVIOR (your first turn, message 4) ═══ Keep this SHORT — maximum 300 words. - Note 2-3 KEY PATTERNS across the three analyses - Identify where specialists AGREE (high-confidence) - Identify where they CONTRADICT (needs resolution) - Ask 2-3 SPECIFIC QUESTIONS for Round 2 Round 1 MUST NOT: assign tasks, create action plans, write reports, or tell agents what to take lead on. Observation + questions ONLY. ═══ ROUND 2 BEHAVIOR (your final turn, message 8) ═══ Keep this FOCUSED — maximum 800 words. Produce a structured report: 1. **Root Cause** (1 paragraph): The #1 most likely cause with causal chain across layers. Reference specific findings from each specialist. 2. **Confidence** (short list): - HIGH: Areas where all 3 agreed - MEDIUM: Areas where 2 of 3 agreed - LOW: Disagreements needing investigation 3. **Action Plan** (numbered, max 6 items): For each: - What to do (specific) - Owner (Client/Network/Server team) - Time estimate 4. **Quick Wins vs Long-term** (2 short lists) Do NOT repeat what specialists already said verbatim. Synthesize, don't echo.""", service=service, ) # ═══════════════════════════════════════════════════ # All 4 agents — order = RoundRobin order # ═══════════════════════════════════════════════════ agents = [client_agent, network_agent, server_agent, coordinator_agent] print(f"{len(agents)} agents created:") for i, a in enumerate(agents, 1): print(f" {i}. {a.name}: {a.description[:60]}...") print(f"\nRoundRobin order: {' → '.join(a.name for a in agents)}") Cell 4: Run the Analysis from semantic_kernel.agents import GroupChatOrchestration, RoundRobinGroupChatManager from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent from IPython.display import display, Markdown # ╔══════════════════════════════════════════════════════════╗ # ║ EDIT YOUR PROBLEM STATEMENT HERE ║ # ╚══════════════════════════════════════════════════════════╝ PROBLEM = """ Users are reporting intermittent 504 Gateway Timeout errors when trying to upload files larger than 10MB through our web application. The issue started after last Friday's deployment and seems worse during peak hours (2-5 PM EST). Some users also report that smaller file uploads work fine but the progress bar freezes at 85% for large files before timing out. """ # ════════════════════════════════════════════════════════════ agent_responses = [] def agent_response_callback(message: ChatMessageContent) -> None: name = message.name or "Unknown" content = message.content or "" agent_responses.append({"agent": name, "content": content}) emoji = { "ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯" }.get(name, "🔹") round_num = (len(agent_responses) - 1) // len(agents) + 1 display(Markdown( f"---\n### {emoji} {name} (Message {len(agent_responses)}, Round {round_num})\n\n{content}" )) MAX_ROUNDS = 8 # 4 agents × 2 rounds = 8 messages exactly task = f"""## Problem Statement {PROBLEM.strip()} ## Discussion Rules You are in a GROUP DISCUSSION with 4 members. You can see ALL previous messages. There are exactly 2 rounds. ### ROUND 1 (Messages 1-4): Independent Analysis - ClientAnalyst, NetworkAnalyst, ServerAnalyst: Analyze from YOUR domain only. Give your top 3 most likely causes with evidence and reasoning. - Coordinator: Note patterns across the 3 analyses. Ask 2-3 specific questions. Do NOT assign tasks yet. ### ROUND 2 (Messages 5-8): Cross-Examination & Final Report - ClientAnalyst, NetworkAnalyst, ServerAnalyst: You MUST reference the OTHER specialists BY NAME. State where you agree, disagree, or have new insights. Answer the Coordinator's questions. Provide SUBSTANTIVE analysis. - Coordinator: Produce the FINAL structured report: root cause, confidence levels, prioritized action plan with owners and time estimates. IMPORTANT: Each agent speaks as THEMSELVES only. Never impersonate another agent.""" display(Markdown(f"## Problem Statement\n\n{PROBLEM.strip()}")) display(Markdown(f"---\n## Discussion Starting — {len(agents)} agents, {MAX_ROUNDS} rounds\n")) # Build and run orchestration = GroupChatOrchestration( members=agents, manager=RoundRobinGroupChatManager(max_rounds=MAX_ROUNDS), agent_response_callback=agent_response_callback, ) runtime = InProcessRuntime() runtime.start() result = await orchestration.invoke(task=task, runtime=runtime) final_result = await result.get(timeout=300) await runtime.stop_when_idle() display(Markdown(f"---\n## FINAL CONCLUSION\n\n{final_result}")) Cell 5: Statistics and Validation print("═" * 55) print(" ANALYSIS STATISTICS") print("═" * 55) emojis = {"ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯"} agent_counts = {} agent_chars = {} for r in agent_responses: agent_counts[r["agent"]] = agent_counts.get(r["agent"], 0) + 1 agent_chars[r["agent"]] = agent_chars.get(r["agent"], 0) + len(r["content"]) for agent, count in agent_counts.items(): em = emojis.get(agent, "🔹") chars = agent_chars.get(agent, 0) avg = chars // count if count else 0 print(f" {em} {agent}: {count} msg(s), ~{chars:,} chars (avg {avg:,}/msg)") print(f"\n Total messages: {len(agent_responses)}") total_chars = sum(len(r['content']) for r in agent_responses) print(f" Total analysis: ~{total_chars:,} characters") # Validation print(f"\n Validation:") import re identity_issues = [] for r in agent_responses: other_agents = [a.name for a in agents if a.name != r["agent"]] for other in other_agents: pattern = rf'(?i)as {re.escape(other)}[,:]?\s+I\b' if re.search(pattern, r["content"][:300]): identity_issues.append(f"{r['agent']} impersonated {other}") if identity_issues: print(f" Identity confusion: {identity_issues}") else: print(f" No identity confusion detected") thin = [r for r in agent_responses if len(r["content"].strip()) < 100] if thin: for t in thin: print(f" Thin response from {t['agent']}") else: print(f" All responses are substantive") Cell 6: Save Report from datetime import datetime timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"analysis_report_{timestamp}.md" with open(filename, "w", encoding="utf-8") as f: f.write(f"# Problem Analysis Report\n\n") f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n") f.write(f"**Agents:** {', '.join(a.name for a in agents)}\n") f.write(f"**Rounds:** {MAX_ROUNDS}\n\n---\n\n") f.write(f"## Problem Statement\n\n{PROBLEM.strip()}\n\n---\n\n") for i, r in enumerate(agent_responses, 1): em = emojis.get(r['agent'], '🔹') round_num = (i - 1) // len(agents) + 1 f.write(f"### {em} {r['agent']} (Message {i}, Round {round_num})\n\n") f.write(f"{r['content']}\n\n---\n\n") f.write(f"## Final Conclusion\n\n{final_result}\n") print(f"Report saved to: {filename}") 7. The Agent Interaction Flow — Round by Round Here's what actually happens during the 8-message orchestration: Round 1: Independent Analysis (Messages 1-4) Msg Agent What They See What They Do 1 ClientAnalyst Problem statement only Analyzes from client perspective: upload chunking, progress bar freezing at 85%, CORS, content-type headers 2 NetworkAnalyst Problem + ClientAnalyst's analysis Gives INDEPENDENT analysis despite seeing msg 1: load balancer timeouts, proxy body size limits, TCP window scaling 3 ServerAnalyst Problem + msgs 1-2 Gives INDEPENDENT analysis: recent deployment regression, request body parser, thread pool exhaustion, disk space 4 Coordinator Problem + msgs 1-3 Observes patterns: "All three mention timeout configuration. ClientAnalyst and NetworkAnalyst both point to body size. Question: Was the deployment a backend-only change or did it include infra?" Round 2: Cross-Examination (Messages 5-8) Msg Agent What They Do 5 ClientAnalyst "I agree with NetworkAnalyst that the load balancer timeout is likely a factor — the 85% freeze point matches the 30-second LB timeout for a 10MB upload on our average upload speed. However, I disagree with ServerAnalyst about thread pool exhaustion because the UI shows a clean 504, not a connection reset." 6 NetworkAnalyst "ServerAnalyst's point about the recent deployment is critical — if the new request parser is 3x slower, that would push uploads past the LB timeout. I can confirm the LB has a 30s idle timeout. The fix is both: increase LB timeout AND optimize the parser." 7 ServerAnalyst "Responding to Coordinator's question: The deployment was backend-only — a new multipart parser using streaming instead of buffered reads. ClientAnalyst is correct that the 504 is from the LB, not the app. The app itself returns 200 after 45 seconds, but the LB kills the connection at 30." 8 Coordinator Produces final structured report with root cause: "The backend deployment introduced a slower multipart parser (45s vs 15s for 10MB). The load balancer's 30s timeout kills the connection at ~85% progress. Fix: immediate — increase LB timeout to 120s. Short-term — optimize parser. Long-term — implement chunked uploads with progress resumption." Notice: The Round 2 analysis is dramatically better than Round 1. Agents reference each other by name, build on each other's findings, and the Coordinator can synthesize a cross-layer causal chain that no single agent could have produced. I made a small adjustment to the issue with Azure Web Apps. Please find the details below from testing carried out using Google Gemini: 8. Bugs I Found & Fixed — Lessons Learned Building this demo taught me several important lessons about multi-agent systems: Bug 1: Agents Speaking Only Once Symptom: Only 4 messages instead of 8. Root cause: The agents list was missing the Coordinator. It was defined in a separate cell and wasn't included in the members list. Fix: All 4 agents must be in the same list passed to GroupChatOrchestration. Bug 2: NetworkAnalyst Says "I'm Ready to Proceed" Symptom: NetworkAnalyst's Round 2 response was just "I'm ready to proceed with the analysis" — no actual content. Root cause: The Coordinator's Round 1 message was assigning tasks ("NetworkAnalyst, please check the load balancer config"), and the agent was acknowledging the assignment instead of analyzing. Fix: Added explicit constraint to Coordinator: "Round 1 MUST NOT assign tasks — observation + questions ONLY." Bug 3: ServerAnalyst Says "As NetworkAnalyst, I..." Symptom: ServerAnalyst's response started with "As NetworkAnalyst, I believe..." Root cause: LLM identity bleeding. When agents share ChatHistory, the LLM sometimes loses track of which agent it's currently playing. This is especially common with Gemini. Fix: Identity anchoring at the very top of every agent's instructions: "You are ONLY ServerAnalyst. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator." Bug 4: Gemini Gives Thin/Empty Responses Symptom: Some agents responded with just one sentence or "I concur." Root cause: Gemini 2.5 Flash is more concise than GPT-4o by default. Without explicit length requirements, it takes shortcuts. Fix: Added "Every response MUST be at least 200 words" and "Answer the Coordinator's questions" to every specialist's instructions. Bug 5: Coordinator's Report is 18K Characters Symptom: The Coordinator's Round 2 response was absurdly long — repeating everything every specialist said. Fix: Added word limits: "Round 1 max 300 words, Round 2 max 800 words" and "Synthesize, don't echo." Bug 6: MAX_ROUNDS Math Symptom: With MAX_ROUNDS=9, ClientAnalyst spoke a 3rd time after the Coordinator's final report — breaking the clean 2-round structure. Fix: MAX_ROUNDS must equal (number of agents × number of rounds). For 4 agents × 2 rounds = 8. 9. Running with Different AI Providers The beauty of SK's Strategy Pattern is that you change ONE LINE to switch providers. Everything else — agents, orchestration, callbacks, validation — stays identical. Gemini setup: from semantic_kernel.connectors.ai.google.google_ai import GoogleAIChatCompletion service = GoogleAIChatCompletion( gemini_model_id="gemini-2.5-flash", api_key=os.getenv("GOOGLE_AI_API_KEY"), ) OpenAI Setup from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion service = OpenAIChatCompletion( ai_model_id="gpt-4o", api_key=os.getenv("OPEN_AI_API_KEY"), ) 10. What to Build Next Add Plugins to Agents Give agents real tools — not just LLM reasoning - looks exciting right ;) class NetworkDiagnosticPlugin: (description="Pings a host and returns latency") def ping(self, host: str) -> str: result = subprocess.run(["ping", "-c", "3", host], capture_output=True, text=True) return result.stdout class LogSearchPlugin: (description="Searches server logs for error patterns") def search_logs(self, pattern: str, hours: int = 1) -> str: # Query your log aggregator (Splunk, ELK, Azure Monitor) return query_logs(pattern, hours) Add Filters for Governance Intercept every agent call for PII redaction and audit logging: .filter(filter_type=FilterTypes.FUNCTION_INVOCATION) async def audit_filter(context, next): print(f"[AUDIT] {context.function.name} called by agent") await next(context) print(f"[AUDIT] {context.function.name} returned") Try Different Orchestration Patterns Replace GroupChat with Sequential for a pipeline approach: # Instead of debate, each agent builds on the previous orchestration = SequentialOrchestration( members=[client_agent, network_agent, server_agent, coordinator_agent] ) Or Concurrent for parallel analysis: # All specialists analyze simultaneously, Coordinator aggregates orchestration = ConcurrentOrchestration( members=[client_agent, network_agent, server_agent] ) Deploy to Azure Move from InProcessRuntime to Azure Container Apps for production scaling. The agent code doesn't change — only the runtime. Summary The key insight from building this demo: multi-agent systems produce better results than single agents not because each agent is smarter, but because the debate structure forces cross-domain thinking that a single prompt can never achieve. The Coordinator's final report consistently identifies causal chains that span client, network, and server layers — exactly the kind of insight that production incident response teams need. Semantic Kernel makes this possible with clean separation of concerns: agents define WHAT to analyze, orchestration defines HOW they interact, the manager defines WHO speaks when, the runtime handles WHERE it executes, and callbacks let you OBSERVE everything. Each piece is independently swappable — that's the power of SK from Microsoft. Resources: GitHub: github.com/microsoft/semantic-kernel Docs: learn.microsoft.com/semantic-kernel Orchestration Patterns: learn.microsoft.com/semantic-kernel/frameworks/agent/agent-orchestration Discord: aka.ms/sk/discord Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for building AI agents with SK (Semantic Kernel) 😀593Views3likes1CommentBicep Error While Integration API Management with Function App
Team, We are trying to automate the deployment of Azure Function App with Azure API Management so that when there is a schema change, we dont have to manually update the APIM with the functionApp schema changes. Our API Management and Function App are inside different resource groups, but inside the same subscription. I have a main.bicep file withe following relevant parts. main.bicep targetScope = 'resourceGroup' parameter definisions Modules defined inside the bicep file like this module functionapp 'modules/funcapp.bicep' = { name: 'funappDeploy' params: { tags: tags location: location application: application environment: environment instrumentationKey: appinsight.outputs.instrumentationKey storageAccountName: storageaccount.outputs.storageAccountName functionSubNetId: networking.outputs.functionSubnetId appInsightsConSting: appinsight.outputs.appInsightsConSting } } module apimanagement 'modules/apimanagement.bicep' = { name: 'apimdeploy' params: { application: application environment: environment } } API Management resource is already existing and API's including my relevant API is deployed inside. This was done early without automation. And here is my apimanagement.bicep targetScope = 'resourceGroup' param environment string param application string = 'gcob-ncino' var apimName = 'apim-rxx-${application}-${environment}' var resourceGroupName = 'rg-rxx-apim-${application}-${environment}' resource apiManagement 'Microsoft.ApiManagement/service@2024-05-01' existing = { name: apimName scope: resourceGroup(resourceGroupName) } resource apimApi 'Microsoft.ApiManagement/service/apis@2024-05-01' = { name: 'yourApimApiName' parent: apiManagement properties: { name: 'risk-rating' // This is the API name within APIM path: '/' // The base path for this API displayName: 'risk-rating' } } resource apimBackend 'Microsoft.ApiManagement/service/backends@2024-05-01' = { name: 'my-function-app' parent: apiManagement properties: { protocol: 'https' // Or https url: 'azfun-my-dev.azurewebsites.net' } } resource apimApiBackendLink 'Microsoft.ApiManagement/service/apis/backends@2021-08-01' = { name: 'my-function-app-link' parent: apimApi properties: { backendId: apimBackend.id } } While deploying through the Azure DevOps, I see the below error in the pipeline. A resource's computed scope must match that of the Bicep file for it to be deployable. This resource's scope is computed from the "scope" property value assigned to ancestor resource "apiManagement". You must use modules to deploy resources to a different scope. [https://aka.ms/bicep/core-diagnostics#BCP165] Can someone suggest, how to fix this problem? Thanks DD117Views0likes2Comments🚀 Git-Driven Deployments for Microsoft Fabric Using GitHub Actions
👋 Introduction If you've been working with Microsoft Fabric, you've likely faced this question: "How do we promote Fabric items from DEV → QA → PROD reliably, consistently, and with proper governance?" Many teams default to the built-in Fabric Deployment Pipelines — and they work great for simpler scenarios. But what happens when your enterprise demands: 🔒 Centralized governance across all platforms (infra, app, and data) 📜 Full audit trail of every change tied to a Git commit ✅ Approval gates with reviewer-based promotion 🔑 Per-environment service principal isolation 🧩 Alignment with your existing DevOps standards That's exactly the problem we set out to solve. In this post, I'll walk you through a production-ready, enterprise-grade CI/CD solution for Microsoft Fabric using the fabric-cicd Python library and GitHub Actions — with zero dependency on Fabric Deployment Pipelines. 🎯 What Problem Are We Solving? Traditional Fabric promotion workflows often look like this: Step Method Problem Build in DEV workspace Fabric Portal UI ✅ Works fine Promote to QA Fabric Deployment Pipeline or manual copy ⚠️ No Git traceability Promote to PROD Fabric Deployment Pipeline with approval ⚠️ Separate governance model from app/infra CI/CD Rollback 🤷 Manual recreation ❌ No deterministic rollback path Audit "Who clicked what, when?" ❌ Limited trail The Core Issue Fabric Deployment Pipelines introduce a parallel governance model that's disconnected from how your platform and application teams already work. You end up with: 🔀 Two different promotion systems (GitHub Actions for apps, Fabric Pipelines for data) 🕳️ Governance blind spots between the two 😰 Cultural friction ("Why do data teams have a different process?") Our Approach: Git as the Single Source of Truth 📖 ┌─────────────┐ push to main ┌─────────────┐ │ Developer │ ──────────────────▶ │ GitHub │ │ commits to │ │ Actions │ │ Git repo │ │ Workflow │ └─────────────┘ └──────┬──────┘ │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ 🟢 DEV │ │ 🟡 QA │ │ 🔴 PROD │ │ Auto │────▶│ Approval │────▶│ Approval │ │ Deploy │ │ Required │ │ Required │ └──────────┘ └──────────┘ └──────────┘ Every deployment originates from Git. Every promotion is traceable to a commit SHA. Every environment has its own approval gate. One pipeline model — across everything. 🏗️ Solution Architecture 📁 Repository Structure fabric-cicd-project/ │ ├── 📂 .github/ │ ├── 📂 workflows/ │ │ └── 📄 fabric-cicd.yml # GitHub Actions pipeline │ ├── 📄 CODEOWNERS # Review enforcement │ └── 📄 dependabot.yml # Automated dependency updates │ ├── 📂 config/ │ └── 📄 parameter.yml # Environment-specific parameterization │ ├── 📂 deploy/ │ ├── 📄 deploy_workspace.py # Main deployment entrypoint │ └── 📄 validate_repo.py # Pre-deployment validation │ ├── 📂 workspace/ # Fabric items (Git-integrated / PBIP) │ ├── 📄 .env.example # Environment variable template ├── 📄 .gitignore ├── 📄 ruff.toml # Python linting config ├── 📄 requirements.txt # Pinned dependencies ├── 📄 SECURITY.md # Vulnerability disclosure policy └── 📄 README.md 🔧 Key Components Component Purpose fabric-cicd Python library Deploys Fabric items from Git to workspaces (handles all Fabric API calls internally) deploy_workspace.py CLI entrypoint — authenticates, configures, deploys, logs parameter.yml Find-and-replace rules for environment-specific values (connections, lakehouse IDs, etc.) validate_repo.py Pre-flight checks — validates repo structure, parameter.yml presence, .platform files fabric-cicd.yml GitHub Actions workflow — orchestrates validate → DEV → QA → PROD ✨ Feature Deep Dive 1️⃣ Per-Environment Service Principal Isolation 🔐 Instead of a single shared service principal, each environment gets its own: DEV_TENANT_ID / DEV_CLIENT_ID / DEV_CLIENT_SECRET QA_TENANT_ID / QA_CLIENT_ID / QA_CLIENT_SECRET PROD_TENANT_ID / PROD_CLIENT_ID / PROD_CLIENT_SECRET Why this matters: 🛡️ Least-privilege access — the DEV SP can't touch PROD 🔍 Audit clarity — you know which identity deployed where 💥 Blast radius reduction — a compromised DEV secret doesn't affect PROD The deploy script automatically resolves the correct credentials based on TARGET_ENVIRONMENT, with fallback to shared FABRIC_* variables for simpler setups. 2️⃣ Environment-Specific Parameterization 🎛️ A single parameter.yml drives all environment differences: find_replace: - find: "DEV_Lakehouse" replace_with: DEV: "DEV_Lakehouse" QA: "QA_Lakehouse" PROD: "PROD_Lakehouse" - find: "dev-sql-server.database.windows.net" replace_with: DEV: "dev-sql-server.database.windows.net" QA: "qa-sql-server.database.windows.net" PROD: "prod-sql-server.database.windows.net" ✅ Same Git artifacts → different runtime bindings per environment ✅ No manual edits between promotions ✅ Easy to review in pull requests 3️⃣ Approval-Gated Promotions ✅ The GitHub Actions workflow uses GitHub Environments with reviewer requirements: Environment Trigger Approval 🟢 DEV Automatic on push to main None — deploys immediately 🟡 QA After successful DEV deploy ✅ Requires reviewer approval 🔴 PROD After successful QA deploy ✅ Requires reviewer approval Reviewers see a rich job summary in GitHub showing: 📌 Git commit SHA being deployed 🎯 Target workspace and environment 📦 Item types in scope ⏱️ Deployment duration ✅ / ❌ Final status 4️⃣ Pre-Deployment Validation 🔍 Before any deployment runs, a dedicated validate job checks: Check What It Does 📂 workspace exists Ensures Fabric items are present 📄 parameter.yml exists Ensures parameterization is configured 📄 .platform files present Validates Fabric Git integration metadata 🐍 ruff check deploy/ Lints Python code for syntax errors and bad imports If validation fails, no deployment runs — across any environment. 5️⃣ Full Git SHA Traceability 📜 Every deployment logs and surfaces the exact Git commit being deployed: Why this matters: 🔄 Rollback = git revert <sha> + push → pipeline redeploys previous state 🕵️ Audit = every PROD deployment tied to a specific commit, reviewer, and timestamp 🔀 Diff = git diff v1..v2 shows exactly what changed between deployments 6️⃣ Concurrency Control 🚦 concurrency: group: fabric-deploy-${{ github.ref }} cancel-in-progress: false Two rapid pushes to main won't cause parallel deployments fighting over the same workspace. The second run queues until the first completes. 7️⃣ Smart Path Filtering 🧠 paths-ignore: - "**.md" - "docs/**" - ".vscode/**" A README-only commit? A docs update? No deployment triggered. This saves runner minutes and avoids unnecessary approval requests for QA/PROD. 8️⃣ Retry Logic with Exponential Backoff 🔁 The deploy script wraps fabric-cicd calls with retry logic: Attempt 1 → fails (HTTP 429 rate limit) ⏳ Wait 5 seconds Attempt 2 → fails (HTTP 503 transient) ⏳ Wait 15 seconds Attempt 3 → succeeds ✅ Transient Fabric service issues don't break your pipeline — the deployment retries automatically. 9️⃣ Orphan Cleanup 🧹 Set CLEAN_ORPHANS=true and items that exist in the workspace but not in Git get removed: Workspace has: Notebook_A, Notebook_B, Notebook_C Git repo has: Notebook_A, Notebook_B → Notebook_C gets removed (orphan) This ensures your workspace exactly matches your Git state — no drift, no surprises. 🔟 Dependency Management with Dependabot 🤖 # .github/dependabot.yml updates: - package-ecosystem: "pip" schedule: interval: "weekly" - package-ecosystem: "github-actions" schedule: interval: "weekly" fabric-cicd, azure-identity, and GitHub Actions versions are automatically monitored. When updates are available, Dependabot opens a PR — keeping your pipeline secure and current. 1️⃣1️⃣ CODEOWNERS Enforcement 👥 # .github/CODEOWNERS /deploy/ @platform-team /config/ @platform-team /.github/workflows/ @platform-team Changes to deployment scripts, parameterization, or the workflow require review from the platform team. No one accidentally modifies the pipeline without oversight. 1️⃣2️⃣ Job Timeouts ⏱️ Job Timeout Validate 10 minutes Deploy (DEV/QA/PROD) 30 minutes A hung process won't burn 6 hours of runner time. It fails fast, alerts the team, and frees the runner. 1️⃣3️⃣ Security Policy 🛡️ A dedicated SECURITY.md provides: 📧 Responsible vulnerability disclosure process ⏰ 48-hour acknowledgement SLA 📋 Best practices for contributors (no secrets in code, least-privilege SPs, 90-day rotation) 🔄 The Complete Workflow Here's what happens end-to-end when a developer merges a PR: 1. 👨💻 Developer merges PR to main │ 2. 🔍 VALIDATE job runs │ ✅ Repo structure checks │ ✅ Python linting (ruff) │ ✅ parameter.yml validation │ 3. 🟢 DEPLOY-DEV job runs (automatic) │ 🔑 Authenticates with DEV SP │ 📦 Deploys all items to DEV workspace │ 📝 Logs commit SHA + summary │ 4. 🟡 DEPLOY-QA job waits for approval │ 👀 Reviewer checks job summary │ ✅ Reviewer approves │ 🔑 Authenticates with QA SP │ 📦 Deploys all items to QA workspace │ 5. 🔴 DEPLOY-PROD job waits for approval │ 👀 Reviewer checks job summary │ ✅ Reviewer approves │ 🔑 Authenticates with PROD SP │ 📦 Deploys all items to PROD workspace │ 6. 🎉 Done — all environments in sync with Git 🆚 Comparison: This Approach vs. Fabric Deployment Pipelines Capability Fabric Deployment Pipelines This Solution (fabric-cicd + GitHub Actions) Source of truth Workspace ✅ Git Promotion trigger UI click / API call ✅ Git push + approval Approval gates Fabric-native ✅ GitHub Environments (same as app teams) Audit trail Fabric activity log ✅ Git commits + GitHub Actions history Rollback Manual ✅ git revert + auto-redeploy Cross-platform governance Separate model ✅ Unified with infra/app CI/CD Parameterization Deployment rules ✅ parameter.yml (reviewable in PR) Secret management Fabric-managed ✅ GitHub Secrets + per-env SP isolation Drift detection Limited ✅ Orphan cleanup (CLEAN_ORPHANS=true) 🚀 Getting Started Prerequisites 3 Fabric workspaces (DEV, QA, PROD) Service principal(s) with Contributor role on each workspace GitHub repository with Actions enabled GitHub Environments configured (dev, qa, prod) Quick Setup # 1. Clone the repo git clone https://github.com/<your-org>/fabric-cicd-project.git # 2. Install dependencies pip install -r requirements.txt # 3. Copy and fill environment variables cp .env.example .env # 4. Run locally against DEV python deploy/deploy_workspace.py GitHub Actions Setup Create GitHub Environments: dev, qa (add reviewers), prod (add reviewers) Add secrets to each environment: DEV_TENANT_ID, DEV_CLIENT_ID, DEV_CLIENT_SECRET QA_TENANT_ID, QA_CLIENT_ID, QA_CLIENT_SECRET PROD_TENANT_ID, PROD_CLIENT_ID, PROD_CLIENT_SECRET DEV_WORKSPACE_ID, QA_WORKSPACE_ID, PROD_WORKSPACE_ID Push to main — the pipeline takes over! 🎉 💡 Lessons Learned After implementing this pattern across several engagements, here are the key takeaways: ✅ What Works Well Teams love the Git traceability once they experience a clean rollback Approval gates in GitHub feel natural to platform engineers Parameter.yml changes in PRs create great review conversations about environment differences Job summaries give reviewers confidence to approve without digging into logs ⚠️ Watch Out For Cultural resistance is the #1 blocker — invest in enablement, not just automation Fabric items with runtime state (data in lakehouses, refresh history) aren't captured in Git Secret rotation across 3+ environments needs process discipline (consider OIDC federated credentials) Run a "portal vs. pipeline" side-by-side demo early — it changes minds fast 🤝 For CSAs: Sharing This With Customers This solution is ideal for customers who: ☑️ Already use GitHub Actions for application or infrastructure CI/CD ☑️ Have governance requirements that demand Git-based audit trails ☑️ Operate multiple Fabric workspaces across environments ☑️ Want to standardize their promotion model across all workloads ☑️ Are moving from Power BI Premium to Fabric and want to modernize their DevOps practices 🗣️ Conversation Starters "How are you promoting Fabric items between environments today?" "Is your data team using the same CI/CD patterns as your app teams?" "If something goes wrong in production, how quickly can you roll back to the previous version?" 📚 Resources 📦 fabric-cicd on PyPI 📖 fabric-cicd Documentation 🐙 GitHub Actions Documentation 🏗️ Microsoft Fabric Git Integration 🌐Git Repository URL: vinod-soni-microsoft/FABRIC-CICD-PROJECT: Enterprise-grade CI/CD solution for Microsoft Fabric using fabric-cicd Python library and GitHub Actions. Git-driven deployments across DEV → QA → PROD with environment approval gates, per-environment service principal isolation, and parameterized promotion — no Fabric Deployment Pipelines required. 🏁 Conclusion The shift from UI-driven promotion to Git-driven CI/CD for Microsoft Fabric isn't just a technical upgrade — it's a governance and cultural alignment decision. By using fabric-cicd with GitHub Actions, you get: 📖 One source of truth (Git) 🔄 One promotion model (GitHub Actions) ✅ One approval process (GitHub Environments) 🔍 One audit trail (Git history + Actions logs) 🔐 One security model (GitHub Secrets + per-env SPs) No parallel governance. No hidden drift. No "who clicked what in the portal." Just Git, code, and confidence. 💪 Have questions or want to share your experience? Drop a comment below — I'd love to hear how your team is approaching Fabric CI/CD! 👇ArchAngel: Skilling the next developer generation for the Agentic transformation.
AI is transforming the SDLC at speed but there's a quieter question following close behind. If the code is being written for your junior developers, when do they learn the skills to become senior ones and how can they use these tools without losing understanding of what they've made? Most tools try to solve this by catching issues after the commit, but by then the teaching moment is already gone. What if your developers could learn why something is wrong and how to fix it, as they write it? ArchAngel is built around a simple idea: What if your team’s best engineering practices could exist directly inside the IDE and guide developers as they work? It turns your team's collective wisdom into a mentor that scales to every new hire, every sprint, every repo. GitHub Copilot has led to great strides in reducing developer toil. It allows teams to move faster, automate repetitive work, and spend more time on higher-level design decisions that drive real impact. As these capabilities evolve, especially with more agentic workflows, developers now have even more powerful ways to generate and iterate on code. The next challenge for engineering teams isn’t adoption. It’s how to make the most of these capabilities while maintaining strong engineering standards, consistency, and shared understanding across teams. Even with powerful tools in place, teams still face familiar challenges. Standards are often spread across repositories, documentation, and conversations. Best practices evolve over time but aren’t always easy to discover or apply consistently. Reviews can become repetitive, and senior engineers spend a lot of time reinforcing patterns rather than focusing on system design. AI accelerates development, but it doesn’t automatically understand how your team builds software, the architectural decisions, trade-offs, and conventions that make systems consistent over time. Technically correct code doesn't mean its organisationally aligned. __________________________________________________________________ The Solution: ArchAngel is built for and by developers. It provides a developer driven, AI coding assistant that teaches without taking autonomy. Rather than waiting for devs to commit their code to pick up issues, ArchAngel sits beside through the coding process to guide and provide iterative live feedback. By connecting ArchAngel to your project bases, ArchAngel can be grounded in your 'golden repositories'; The agreed best practices, existing approved repos and organisational standards. ArchAngel can also generate Code Style and Wiki Docs for a quick look summary guide to your best practices, prompting further research. Let's define the tech stack behind this and what makes ArchAngel work: Semantic Kernel and Microsoft Agent Framework (MAF) - Semantic Kernel is a lightweight, open-source development kit that lets you easily build AI agents and integrate the latest AI models. MAF is an orchestration framework that is used to create agents and conversation history to create and support iterative design. Served via Foundry. Introduction to Semantic Kernel | Microsoft Learn Microsoft Agent Framework Overview | Microsoft Learn Microsoft Foundry documentation | Microsoft Learn Language Server Protocol (LSP) - LSP allows you to invoke ArchAngel from any IDE. For VSCode, command palette is used to invoke certain functionality for ArchAngel such as document generation. Official page for Language Server Protocol SQLite - Store your search database locally with SQLite. Config - Setup just like your other dev tools. For VScode, simply load your base repos into archangel.json, authenticate with github and run the "Index from config file" in your command palette. __________________________________________________________________ Meet Priya: A Senior Software Engineer Responsibilities: Design & build scalable software — Architect and deliver high-quality, maintainable systems from concept through to production. Drive engineering excellence — Lead code reviews, enforce best practices, and champion CI/CD, testing, and observability standards. Mentor junior engineers — Coach and develop less experienced team members, fostering technical growth and a collaborative culture. Collaborate cross-functionally — Work with Product, Design, and stakeholders to translate business needs into robust technical solutions. Innovate & improve continuously — Stay ahead of industry trends, reduce technical debt, and drive adoption of tools and processes that elevate team productivity. Meet Joe: A Junior Software Engineer Responsibilities: Write and maintain clean code — Develop, test, and debug features under guidance from senior engineers, following established coding standards. Learn the codebase & tech stack — Ramp up quickly on existing systems, tools, and workflows to become a productive contributor. Participate in code reviews — Submit code for review and actively learn from feedback, while reviewing peers' work to build technical judgement. Collaborate with the team — Contribute to sprint ceremonies, ask questions, flag blockers early, and communicate progress clearly. Invest in personal growth — Pursue learning through documentation, certifications, pair programming, and internal knowledge-sharing sessions. A day in the life Priya is responsible for Joe's development and guiding him to progress his career, however, Priya is extremely busy and she is managing two other junior devs. So reviewing each junior's code to identify the architectural antipatterns and documentation decisions that deviate from best practice, takes time away from the senior work she brings most value to. Enter ArchAngel New hire/Junior joins 4 months into the project with no clue what repos are being used as an example? The dev team used ArchAngel to create Code Style and Wiki Docs so the junior can at a glance view a few of the standards from the repos. The dev team also has the 'golden repos' linked in the ArchAngel config file, so now all the junior has to do is clone, index and start coding. The junior devs use ArchAngel in their VSCode IDE and receive feedback and critique that is educational directly in their IDE as they code, before anything is committed for formal review. Repository informed chat, code completions and docs make onboarding, learning and guiding junior devs an asynchronous task. Giving Priya some breathing room and improving the quality of the PRs her juniors produce! The ArchAngel's educational, constructive criticism and repository-backed way of reviewing code can also be an asset to learn the PR process itself! Learning what ArchAngel looks out for each time, helps to teach juniors to look out for it themselves in PRs and how to critique it with possible alternatives. __________________________________________________________________ Example Architecture: Project Repo: Your project repository, where your team syncs your config files. Setup for success at a team level. Cloud Environment: Where ArchAngel's brainpower comes from. Secure, scale, monitor and govern with the Azure platform to align organisationally. User Environment: Where you and your teams code, build and create the software that powers the business. Fit to your dev team and extend to your tools. Golden Repos: Your golden base of knowledge. This is the indexed and cited source of your coding standards, the team's guiding principle knowledge and the codebases that your team is striving to match at a code quality and standards level. Next Steps: Making ArchAngel For You Configure + Customise ArchAngel to suit your development environment! VNETs (Virtual Networks) enable secure networking configurations within a cloud environment. Enabling your resources and any tools you wish to setup to communicate in secure channels. Azure Virtual Network Documentation - Tutorials, quickstarts, API references | Microsoft Learn Microsoft Foundry - Engage Foundry's full toolset to make ArchAngel customisable and compliant for your organisation. Microsoft Foundry documentation | Microsoft Learn Foundry Tools: What are Foundry Tools? - Foundry Tools | Microsoft Learn API Management (AI Gateway) - Use AI Gateway and APIM to secure, scale, monitor and govern your agents and connections to Microsoft Foundry. API Management documentation | Microsoft Learn LSP (Language Server Protocol) - Add custom event and event handlers through the LSP channels across IDEs. Official page for Language Server Protocol Customisable Document Generation - Refine the prompts that are used to create the documents and format as you wish! The base version is a skimmed document with a few snippets as source designed to be simple to extend. Find the github repo here: rohitmadhavk/ArchAngel: An education coding assistant to help junior devs learn best practice375Views2likes0Comments