analytics
822 TopicsApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.5.1KViews9likes1CommentStep by Step Guide to Ontology and Plan for Financial Service
What We Will Build In this guide, we will construct a complete Fabric IQ solution that accomplishes the following: First, a Lakehouse that ingests publicly available data including bank financials, P2P lending statistics, borrower demographics, and licensing information. Second, a Semantic Model that defines the analytical layer with proper dimensions, measures, and relationships. Third, an Ontology that elevates these tables into business entities such as Bank, P2P Platform, Borrower, and Loan, connected by meaningful relationships and governed by regulatory rules. Fourth, a Planning sheet that enables supervisors to forecast enforcement workloads, allocate examination budgets, and model scenarios based on live data. Step 1: Preparing the Data Foundation in Fabric Lakehouse Every Fabric IQ solution begins with data. Before we can model business semantics or build planning sheets, we need a well structured Lakehouse that holds our source data in a governed and queryable format. Creating the Lakehouse Navigate to your Fabric workspace and create a new Lakehouse. In this example, we have named it P2PLendingLH, housed within the workspace P2P Lending CrossSector Demo. The Lakehouse serves as the Bronze and Silver layer of our medallion architecture, storing both raw ingested data and transformed analytical tables. Data Sources and Tables The Lakehouse is populated with data from publicly available publications. The table structure follows a dimensional modeling pattern with clear separation between dimension tables (prefixed with dim_) and relationship tables (prefixed with rel_). The following tables form the foundation of our model: Table Name Description dim_bank Bank profiles including KBMI tier, total assets, CAR, NPL, channeling exposure percentage dim_borrower Borrower demographics with credit score, employment type, province, and risk segment dim_p2p_platform Licensed P2P lending operators with TWP90 rate, outstanding balance, and total borrowers dim_loan Individual loan records with amount, tenure, interest rate, and repayment status dim_supervisor_team supervisory teams and their regional assignments dim_channeling_agreement Bank to P2P channeling contracts and exposure limits In addition to dimension tables, several relationship tables capture the connections between entities. These include rel_bank_channels_platform (which bank funds which P2P platform), rel_borrower_takes_loan (linking borrowers to their loans), rel_loan_funded_by_bank (tracing the funding chain), rel_platform_issues_loan (connecting platforms to the loans they originate), and rel_supervisor_oversees_platform and rel_supervisor_oversees_bank (mapping supervisory responsibility). Step 2: Creating the Semantic Model With data in the Lakehouse, the next step is to create a Semantic Model that defines the analytical interface. The Semantic Model is a Power BI construct that organizes your tables into a star schema with proper relationships, hierarchies, and measures. More importantly for our purpose, this Semantic Model will later serve as the blueprint from which we generate our Ontology. Generating the Model from Lakehouse From within the Lakehouse, click on "New semantic model" in the toolbar. A dialog appears allowing you to name your model and select which tables to include. In our case, we select all dimension and relationship tables to ensure the Ontology will have full visibility into the data landscape. Figure 1. Creating a new Direct Lake semantic model from the P2PLendingLH Lakehouse, selecting dimension and relationship tables for inclusion. Notice that the dialog shows the workspace name (P2P Lending CrossSector Demo) and provides a searchable list of all available tables. The Direct Lake mode is automatically selected, which means the Semantic Model will query data directly from the Lakehouse parquet files without importing a copy. This is important for our use case because it ensures that when regulator publishes updated monthly statistics and the Lakehouse is refreshed, the Semantic Model and subsequently the Ontology will reflect the latest data. Configuring Relationships and Properties After creation, the Semantic Model opens in the editing view where you can configure relationships, add calculated measures, and define display properties. The model view shows the entity cards with their fields and the lines connecting related tables. Figure 2. The Semantic Model editor showing entity cards for dim_bank and dim_borrower, with relationship lines and the full table listing in the Data panel. In the screenshot above, you can see two of the core dimension tables. The dim_bank table contains fields such as bank_id, bank_type, channeling_exposure_pct, channeling_total, name, regulator_team, and total_assets. The dim_borrower table holds borrower_id, credit_score, employment_type, name, province, and risk_segment. The Data panel on the right reveals the complete set of tables available in this model, including all the relationship tables that define the connections between entities. At this stage, you should verify that all necessary relationships are correctly established. For example, dim_bank should connect to rel_bank_channels_platform through bank_id, and dim_p2p_platform should connect to rel_platform_issues_loan through platform_id. These relationships are what enable the Ontology to reason across domains in the next step. You may also want to add calculated measures at this point, such as a weighted average TWP90 across all platforms funded by a specific bank, or a total channeling exposure as a percentage of the bank's total assets. These measures will be carried forward into the Ontology and can be used by AI agents for natural language querying. Step 3: Generating the Ontology This is the step where the magic of Fabric IQ truly comes alive. The Ontology transforms your Semantic Model from a reporting layer into an intelligence layer. While the Semantic Model answers the question "what does the data look like," the Ontology answers the question "what does the data mean." What the Ontology Does An Ontology in Fabric IQ is a machine understandable vocabulary of your business. It consists of entity types (the things in your environment, such as Bank, Borrower, or P2P Platform), properties (the facts about those entities, such as a bank's NPL ratio or a platform's TWP90 rate), and relationships (the ways entities connect, such as a Bank channels funding to a P2P Platform). Beyond static modeling, the Ontology also supports rules and constraints that can trigger automated actions when business conditions are met. Generating from the Semantic Model To create the Ontology, open your Semantic Model and look for the "Generate Ontology" button in the toolbar. Clicking it opens the generation dialog, which presents three key value propositions: Unify models into a semantic layer allows you to align concepts across domains and modeling paradigms, bringing banking data and P2P lending data into a shared vocabulary. Model expressively enables you to capture complex relationships, domain specific rules, and actions that drive business workflows, such as triggering an alert when a P2P platform's TWP90 crosses the 5 percent regulatory threshold. Reason over events and temporal patterns means that the Ontology can use sequences and trends to inform decisions and automation, such as detecting three consecutive months of TWP90 deterioration. Figure 3. The Ontology generation dialog, creating a new Ontology named NewP2P from the existing Semantic Model within the P2P Lending CrossSector Demo workspace. In the dialog, you specify the workspace (P2P Lending CrossSector Demo) and give your Ontology a name (in this example, NewP2P). After clicking Create, Fabric IQ analyzes the Semantic Model's structure, identifies entity types from dimension tables, infers relationships from the foreign key connections, and generates a navigable graph that represents your business domain. Enriching the Ontology with Rules Once the Ontology is generated, you can enrich it with business rules that reflect regulatory requirements. For the P2P lending use case, the following rules are particularly relevant: Rule Name Condition Action Elevated TWP90 P2P Platform TWP90 exceeds 5 percent Flag platform as high risk and alert PVML supervisor Contagion Risk Bank channeling exposure to flagged P2P platform exceeds 10 percent of portfolio Alert Banking supervisor and recommend joint examination Youth Overleveraged Borrowers aged 19 to 34 represent more than 60 percent of a platform's portfolio AND TWP90 is above average Trigger consumer protection review and education program allocation CAR Threshold Bank CAR drops below 10 percent while having active P2P channeling agreements Escalate to Kepala Eksekutif Pengawas Perbankan These rules integrate with Fabric Activator, enabling the Ontology to automatically initiate business processes through alerts and automated actions. This means that when new monthly P2P statistics are ingested and a platform's TWP90 crosses the threshold, the system does not wait for an analyst to discover it manually. The rule fires, the alert is sent, and the supervisory workflow begins. Querying with Natural Language One of the most powerful capabilities enabled by the Ontology is the ability to query across domains using natural language through a Data Agent. Because the Ontology defines the business vocabulary and binds it to real data, a supervisor can ask questions like: "Which banks have channeling agreements with P2P platforms whose TWP90 is currently above 5 percent, and what is their total exposure?" The Data Agent resolves this query by traversing the Ontology graph: from the Bank entity through the channels_funding_to relationship to P2P Platform, filtering by the TWP90 property, and aggregating the channeling_total measure. Step 4: Setting Up Planning Sheets While the Ontology tells you what is happening in your business right now, the Plan item in Fabric IQ helps you decide what should happen next. Planning in Fabric IQ brings budgeting, forecasting, and scenario modeling directly into the same environment where your data lives, eliminating the disconnect between analytical insights and forward looking decisions. Creating a Planning Sheet To create a Plan, navigate to your workspace and select New Item followed by Plan (preview). After naming the plan and connecting it to your Semantic Model, you can begin building Planning sheets that pull dimensions and measures directly from the same data that powers your Ontology. In the screenshot below, we see a Planning sheet named "Planning P2P" that presents a tabular view of all P2P lending platforms alongside their key risk metrics. Figure 4. The Planning sheet showing P2P lending platforms with their TWP90 rates, total outstanding balances (in trillions of Rupiah), total borrower counts (in thousands), and risk categories. The Planning sheet is structured with the platform name and risk_category as row dimensions, and three critical measures as values: Sum of twp90_rate, Sum of total_outstanding (displayed in trillions of Rupiah), and Sum of total_borrowers (displayed in thousands). The risk_category column provides an immediate visual classification of each platform's health status, with categories such as Elevated and Very High clearly indicating where supervisory attention should be directed. Looking at the data, several insights emerge immediately. DanaBijak and DanaCepat both carry a Very High risk category, with TWP90 rates of 18.77 and 17.79 respectively. CashWagon ID shows an Elevated risk designation despite a comparatively modest TWP90 of 8.26, likely due to its substantial outstanding balance of 144.97 thousand borrowers. The aggregate row at the top reveals the industry total: a combined TWP90 of 365.38 (this is a sum across all platforms), total outstanding of 29.86 trillion Rupiah, and 7,281.55 thousand borrowers across the monitored universe. Using Planning for Supervisory Resource Allocation The real power of the Planning sheet becomes apparent when supervisors begin using it for forward looking decisions. Consider the following scenarios that can be modeled directly within the Planning interface: Enforcement Forecasting: Based on the current data showing multiple platforms in the Very High risk category, supervisors can forecast the expected volume of warning letters and administrative sanctions for the coming quarter. If historical patterns show that each Very High platform typically receives two to three rounds of correspondence before resolution, the planning sheet can project staffing requirements for the enforcement team. Budget Allocation: The Planning sheet can incorporate budget dimensions alongside risk metrics. If the current quarterly examination budget allows for on site visits to 15 platforms, the risk category column helps prioritize which platforms should be visited first. The forecast capability can then project whether the budget is sufficient given the current risk trajectory, or whether a reallocation request should be submitted.155Views0likes0CommentsMicrosoft Fabric Operations Agent Step by Step Walkthrough
Fabric Capacity and Workspace You need a Microsoft Fabric workspace backed by a paid capacity. Trial capacities are not supported for Operations Agent. Your capacity must be provisioned in a supported region. As of April 2026, Operations Agent is available in all Microsoft Fabric regions except South Central US and East US. If your capacity is outside the US or EU, you will also need to enable cross geo processing and storage for AI through the tenant settings. Your workspace must contain an Eventhouse with at least one KQL database. The Eventhouse is the telemetry backbone, and the KQL database holds the tables the agent will monitor. In the screenshot below, you can see a workspace named OperationAgent-WS that contains an Eventhouse (ops_eventhouse), two KQL databases (ops_db and ops_eventhouse), and a Lakehouse (ops_lakehouse). This is the environment used throughout this guide. Figure 1. Workspace contents showing the Eventhouse, KQL databases, and Lakehouse ready for the Operations Agent. Enabling the Operations Agent in the Admin Portal A Fabric administrator must enable the Operations Agent preview toggle in the Admin Portal before anyone in the organization can create an agent. Navigate to the Admin Portal, locate the section for Real Time Intelligence, and find the setting labeled Enable Operations Agents (Preview). Toggle it to Enabled for the entire organization or for specific security groups depending on your governance requirements. In addition to this toggle, ensure that Microsoft Copilot and Azure OpenAI Service are also enabled at the tenant level. The Operations Agent relies on Azure OpenAI to generate its playbook and to reason about data when conditions are met. Figure 2. The Admin Portal showing the Enable Operations Agents (Preview) toggle set to Enabled for the entire organization. Note that messages sent to Operations Agents are processed through the Azure AI Bot Service. If your capacity is outside the EU Data Boundary, data may be processed outside your geographic or national cloud boundary. Be sure to communicate this to your compliance stakeholders before enabling the feature in production tenants. Microsoft Teams Account Every person who will receive recommendations from the agent must have a Microsoft Teams account. The Operations Agent delivers its findings and action suggestions through a dedicated Teams app called Fabric Operations Agent. You can install this app from the Teams app store by searching for its name. Once installed, the agent will be able to send messages containing data summaries and recommended actions directly to the designated recipients. Creating and Configuring the Operations Agent With your prerequisites in place, you are ready to create the Operations Agent. The following steps walk you through the entire configuration process using the Fabric portal. Step 1: Create a New Operations Agent Open the Microsoft Fabric portal and navigate to your workspace. On the Fabric home page, select the ellipsis icon and then select Create. In the Create pane, scroll to the Real Time Intelligence section and select Operations Agent. A dialog will appear asking you to name your agent and select the target workspace. Choose a descriptive name that reflects the agent’s purpose. In this guide, the agent is named OperationsAgent_1 and is deployed to the OperationAgent-WS workspace. Step 2: Define Business Goals and Agent Instructions Once the agent is created, you are taken to the Agent Setup page. This page is divided into two halves. On the left side, you configure the agent’s behavior. On the right side, you see the generated Agent Playbook after saving. The first field is Business Goals, where you describe the high level objective the agent should accomplish. Write this in clear, outcome oriented language. In this demo, the business goal is set to: “Monitor data pipeline execution and alert on failures.” The second field is Agent Instructions, where you provide more specific guidance on how the agent should reason about the data. Think of this as a brief you would hand to an analyst who will be watching your systems overnight. Be explicit about the table name, the column to watch, and the condition that constitutes an alert. In this demo, the instruction reads: “Monitor pipeline_runs table. Alert when status is failed.” Together, the business goals and instructions give the underlying large language model enough context to generate an accurate playbook. The more specific your instructions, the more reliable the agent’s behavior will be. Figure 3. The Agent Setup page showing business goals, agent instructions, and the generated playbook on the right. On the right side of the screen, you can see the Agent Playbook that was generated after saving. The playbook includes a Business Term Glossary, which shows the business objects the agent inferred from your goals and data. In this case, it identified an object called PipelineRun, mapped to the pipeline_runs table, with two properties: status (the pipeline run status from the status column) and runId (the unique identifier from the run_id column). It also displays the Rules section, which contains the conditions the agent will evaluate. Review the playbook carefully. Since it is generated by an AI model, there may be occasional misinterpretations. Verify that every property maps to the correct column and that the rules reflect your intended thresholds. If something is off, update your goals or instructions and save again to regenerate the playbook. Step 3: Add a Knowledge Source Scroll down on the Agent Setup page to find the Knowledge section. This is where you connect the agent to the data it will monitor. When you first open this section, it will display a message indicating that no knowledge source has been added yet. Figure 4. The Knowledge section before any data source has been added. Select the Add Data button to browse the available data sources. A panel will appear listing the KQL databases and Eventhouses accessible within your Fabric environment. In this demo, three sources are available: ops_db in the OperationAgent-WS workspace, wms_eventhouse in the WMS-CDC-Demo workspace, and ops_eventhouse in the OperationAgent-WS workspace. Select the database that contains the table you want the agent to monitor. For this guide, select ops_db, which holds the pipeline_runs table referenced in the agent instructions. Figure 5. Selecting the knowledge source from available KQL databases and Eventhouses. Once the knowledge source is connected, the agent will be able to query this database at regular intervals (approximately every five minutes) to evaluate its rules. Make sure the table in your selected database is actively receiving data, especially if you plan to demonstrate the agent detecting a condition in real time. Step 4: Define Actions Actions are the responses the agent can recommend when it detects a condition that matches its rules. Scroll further down the Agent Setup page to find the Actions section. Select the Add Action button to define a new custom action. A dialog titled New Custom Action will appear. It has three fields. The Action Name is a short, descriptive label for the action. The Action Description explains the purpose of the action and gives the agent context about when to use it. The Parameters section allows you to define input fields that pass dynamic values (such as names, dates, or identifiers) into the Power Automate flow that will be triggered. Figure 6. The New Custom Action dialog where you define the action name, description, and optional parameters. In this demo, the action is named Send Email Alert with a description indicating that it should send an email notification when a pipeline failure is detected. Once created, you can see the action listed in the Actions section with a green status indicator showing that the action is successfully connected. Figure 7. The Actions section showing the Send Email Alert action with a connected status. Step 5: Configure the Custom Action with Power Automate After creating the action, you need to configure it by linking it to an activator item and a Power Automate flow. Select the action you just created to open the Configure Custom Action pane. In this pane, you will see several fields. First, select the Workspace where the activator item resides. In this demo, the workspace is OperationAgent-WS. Next, select the Activator, which is the Fabric item that bridges the Operations Agent and Power Automate. Here, the activator is named Email_Alert_Activator. Once the connection is created, a Connection String is generated. This string is a unique identifier that links the Operations Agent to the Power Automate flow. Select the Copy button to copy this connection string to your clipboard. You will need it in the next step. Below the connection string, you will find the Open Flow Builder button. Select this to launch the Power Automate flow designer where you will build the email notification flow. Figure 8. The Configure Custom Action pane showing the workspace, activator, connection string, and the button to open the flow builder. Step 6: Build the Power Automate Flow When you select Open Flow Builder, a new browser tab opens with the Power Automate designer. The flow is pre-configured with a trigger called When an Activator Rule is Triggered. This trigger fires whenever the Operations Agent approves an action. In the Parameters tab of the trigger, you will see a field labeled Connection String. Paste the connection string you copied from the previous step into this field. This is the critical link that connects the Power Automate flow back to your Operations Agent. If this string is incorrect or missing, the flow will not fire when the agent recommends the action. Figure 9. The Power Automate flow builder with the activator trigger and the Connection String field. Below the trigger, you can add any actions your workflow requires. For an email alert scenario, add an Office 365 Outlook action to send an email to the operations team. You can use dynamic content from the trigger to include details such as the pipeline run ID, the failure status, and any parameters passed through from the Operations Agent. Save the flow and return to the Fabric portal. Your action is now fully configured and ready to be triggered by the agent. Step 7: Generate the Playbook and Start the Agent With all configuration complete (business goals, instructions, knowledge source, and actions), select Save on the Agent Setup page. Fabric will use the underlying large language model to generate the agent’s playbook. The playbook is a structured summary of everything the agent knows: its goals, the properties it monitors, and the rules it evaluates. You can also select Generate Playbook at the top of the page to regenerate the playbook if you have made changes. Review the playbook one final time to confirm that properties map correctly to your table columns and that rules reflect the exact conditions you want to monitor. When you are satisfied, select Start in the toolbar at the top of the page. The agent will begin actively monitoring your data. It queries the knowledge source approximately every five minutes, evaluating the playbook rules against the latest data. If a condition is met, the agent uses the LLM to summarize the data, generate a recommendation, and send a message to the designated recipients through Microsoft Teams. To pause the agent at any time, select Stop. This is useful during demos when you want to control the timing of the demonstration. How the Agent Operates at Runtime Once started, the Operations Agent follows a continuous loop. Every five minutes, it queries the connected KQL database to evaluate the rules defined in the playbook. If no conditions are met, it continues silently. If a condition is matched (for example, a pipeline run with a status of "failed" appears in the pipeline_runs table), the agent proceeds through the following sequence. First, the agent uses the large language model to analyze the data that triggered the condition. It summarizes the context, identifies the relevant business object (such as a specific pipeline run), and determines which action to recommend. Second, the agent sends a message to the designated recipients through Microsoft Teams. This message contains a summary of the detected insight, the data context that triggered it, and a suggested action. Recipients can approve the action by selecting Yes or reject it by selecting No. If parameters are included (such as a run ID or a severity level), they can be reviewed and adjusted before final approval. Third, if the recipient approves the action, the agent executes it on behalf of the creator using the creator’s credentials. In this demo, approving the action would trigger the Power Automate flow that sends an email alert. It is important to note that if a recommendation is not responded to within three days, the operation is automatically canceled. After cancellation, the action can no longer be approved or interacted with.301Views1like0CommentsLegacy SSRS reports after upgrading Azure DevOps Server 2020 to 2022 or 25H2
We are currently planning an upgrade from Azure DevOps Server 2020 to Azure DevOps Server 2022 or 25H2, and one of our biggest concerns is reporting. We understand that Microsoft’s recommended direction is to move to Power BI based on Analytics / OData. However, for on-prem environments with a large number of existing SSRS reports, rebuilding everything from scratch would require significant time and effort. Since Warehouse and Analysis Services are no longer available in newer versions, we would like to understand how other on-prem teams are handling legacy SSRS reporting during and after the upgrade. Have you rebuilt your reports in Power BI, moved to another reporting approach, or found a practical way to keep existing SSRS reports available during the transition? Any real-world experience, lessons learned, or recommended approaches would be greatly appreciated.33Views0likes0CommentsHow Should a Fresher Learn Microsoft Sentinel Properly?
Hello everyone, I am a fresher interested in learning Microsoft Sentinel and preparing for SOC roles. Since Sentinel is a cloud-native enterprise tool and usually used inside organizations, I am unsure how individuals without company access are expected to gain real hands-on experience. I would like to hear from professionals who actively use Sentinel: - How do freshers typically learn and practice Sentinel? - What learning resources or environments are commonly used by beginners? - What level of hands-on experience is realistically expected at entry level? I am looking for guidance based on real industry practice. Thank you for your time.170Views0likes2CommentsMissing details in Azure Activity Logs – MICROSOFT.SECURITYINSIGHTS/ENTITIES/ACTION
The Azure Activity Logs are crucial for tracking access and actions within Sentinel. However, I’m encountering a significant lack of documentation and clarity regarding some specific operation types. Resources consulted: https://learn.microsoft.com/en-us/azure/sentinel/audit-sentinel-data https://learn.microsoft.com/en-us/rest/api/securityinsights/entities?view=rest-securityinsights-2024-01-01-preview https://learn.microsoft.com/en-us/rest/api/securityinsights/operations/list?view=rest-securityinsights-2024-09-01&tabs=HTTP My issue: I observed unauthorized activity on our Sentinel workspace. The Azure Activity Logs clearly indicate the user involved, the resource, and the operation type: "MICROSOFT.SECURITYINSIGHTS/ENTITIES/ACTION" But that’s it. No detail about what the action was, what entity it targeted, or how it was triggered. This makes auditing extremely difficult. It's clear the person was in Sentinel and perform an activity through it, from search, KQL, logs to find an entity from a KQL query. But, that's all... Strangely, this operation is not even listed in the official Sentinel Operations documentation linked above. My question: Has anyone encountered this and found a way to interpret this operation type properly? Any insight into how to retrieve more meaningful details (action context, target entity, etc.) from these events would be greatly appreciated.239Views0likes3CommentsAccelerate Agent Development: Hacks for Building with Microsoft Sentinel data lake
As a Senior Product Manager | Developer Architect on the App Assure team working to bring Microsoft Sentinel and Security Copilot solutions to market, I interact with many ISVs building agents on Microsoft Sentinel data lake for the first time. I’ve written this article to walk you through one possible approach for agent development – the process I use when building sample agents internally at Microsoft. If you have questions about this, or other methods for building your agent, App Assure offers guidance through our Sentinel Advisory Service. Throughout this post, I include screenshots and examples from Gigamon’s Security Posture Insight Agent. This article assumes you have: An existing SaaS or security product with accessible telemetry. A small ISV team (2–3 engineers + 1 PM). Focus on a single high value scenario for the first agent. The Composite Application Model (What You Are Building) When I begin designing an agent, I think end-to-end, from data ingestion requirements through agentic logic, following the Composite application model. The Composite Application Model consists of five layers: Data Sources – Your product’s raw security, audit, or operational data. Ingestion – Getting that data into Microsoft Sentinel. Sentinel data lake & Microsoft Graph – Normalization, storage, and correlation. Agent – Reasoning logic that queries data and produces outcomes. End User – Security Copilot or SaaS experiences that invoke the agent. This separation allows for evolving data ingestion and agent logic simultaneously. It also helps avoid downstream surprises that require going back and rearchitecting the entire solution. Optional Prerequisite You are enrolled in the ISV Success Program, so you can earn Azure Credits to provision Security Compute Units (SCUs) for Security Copilot Agents. Phase 1: Data Ingestion Design & Implementation Choose Your Ingestion Strategy The first choice I face when designing an agent is how the data is going to flow into my Sentinel workspace. Below I document two primary methods for ingestion. Option A: Codeless Connector Framework (CCF) This is the best option for ISVs with REST APIs. To build a CCF solution, reference our documentation for getting started. Option B: CCF Push (Public Preview) In this instance, an ISV pushes events directly to Sentinel via a CCF Push connector. Our MS Learn documentation is a great place to get started using this method. Additional Note: In the event you find that CCF does not support your needs, reach out to App Assure so we can capture your requirements for future consideration. Azure Functions remains an option if you’ve documented your CCF feature needs. Phase 2: Onboard to Microsoft Sentinel data lake Once my data is flowing into Sentinel, I onboard a single Sentinel workspace to data lake. This is a one-time action and cannot be repeated for additional workspaces. Onboarding Steps Go to the Defender portal. Follow the Sentinel Data lake onboarding instructions. Validate that tables are visible in the lake. See Running KQL Queries in data lake for additional information. Phase 3: Build and Test the Agent in Microsoft Foundry Once my data is successfully ingested into data lake, I begin the agent development process. There are multiple ways to build agents depending on your needs and tooling preferences. For this example, I chose Microsoft Foundry because it fit my needs for real-time logging, cost efficiency, and greater control. 1. Create a Microsoft Foundry Instance Foundry is used as a tool for your development environment. Reference our QuickStart guide for setting up your Foundry instance. Required Permissions: Security Reader (Entra or Subscription) Azure AI Developer at the resource group After setup, click Create Agent. 2. Design the Agent A strong first agent: Solves one narrow security problem. Has deterministic outputs. Uses explicit instructions, not vague prompts. Example agent responsibilities: To query Sentinel data lake (Sentinel data exploration tool). To summarize recent incidents. To correlate ISVs specific signals with Sentinel alerts and other ISV tables (Sentinel data exploration tool). 3. Implement Agent Instructions Well-designed agent instructions should include: Role definition ("You are a security investigation agent…"). Data sources it can access. Step by step reasoning rules. Output format expectations. Sample Instructions can be found here: Agent Instructions 4. Configure the Microsoft Model Context Protocol (MCP) tooling for your agent For your agent to query, summarize and correlate all the data your connector has sent to data lake, take the following steps: Select Tools, and under Catalog, type Sentinel, and then select Microsoft Sentinel Data Exploration. For more information about the data exploration tool collection in MCP server, see our documentation. I always test repeatedly with real data until outputs are consistent. For more information on testing and validating the agent, please reference our documentation. Phase 4: Migrate the Agent to Security Copilot Once the agent works in Foundry, I migrate it to Security Copilot. To do this: Copy the full instruction set from Foundry Provision a SCU for your Security Copilot workspace. For instructions, please reference this documentation. Make note of this process as you will be charged per hour per SCU Once you are done testing you will need to deprovision the capacity to prevent additional charges Open Security Copilot and use Create From Scratch Agent Builder as outlined here. Add Sentinel data exploration MCP tools (these are the same instructions from the Foundry agent in the previous step). For more information on linking the Sentinel MCP tools, please refer to this article. Paste and adapt instructions. At this stage, I always validate the following: Agent Permissions – I have confirmed the agent has the necessary permissions to interact with the MCP tool and read data from your data lake instance. Agent Performance – I have confirmed a successful interaction with measured latency and benchmark results. This step intentionally avoids reimplementation. I am reusing proven logic. Phase 5: Execute, Validate, and Publish After setting up my agent, I navigate to the Agents tab to manually trigger the agent. For more information on testing an agent you can refer to this article. Now that the agent has been executed successfully, I download the agent Manifest file from the environment so that it can be packaged. Click View code on the Agent under the Build tab as outlined in this documentation. Publishing to the Microsoft Security Store If I were publishing my agent to the Microsoft Security Store, these are the steps I would follow: Finalize ingestion reliability. Document required permissions. Define supported scenarios clearly. Package agent instructions and guidance (by following these instructions). Summary Based on my experience developing Security Copilot agents on Microsoft Sentinel data lake, this playbook provides a practical, repeatable framework for ISVs to accelerate their agent development and delivery while maintaining high standards of quality. This foundation enables rapid iteration—future agents can often be built in days, not weeks, by reusing the same ingestion and data lake setup. When starting on your own agent development journey, keep the following in mind: To limit initial scope. To reuse Microsoft managed infrastructure. To separate ingestion from intelligence. What Success Looks Like At the end of this development process, you will have the following: A Microsoft Sentinel data connector live in Content Hub (or in process) that provides a data ingestion path. Data visible in data lake. A tested agent running in Security Copilot. Clear documentation for customers. A key success factor I look for is clarity over completeness. A focused agent is far more likely to be adopted. Need help? If you have any issues as you work to develop your agent, please reach out to the App Assure team for support via our Sentinel Advisory Service . Or if you have any other tips, please comment below, I’d love to hear your feedback.548Views2likes0CommentsBuilding Multi-Agent Orchestration Using Microsoft Semantic Kernel: A Complete Step-by-Step Guide
What You Will Build By the end of this guide, you will have a working multi-agent system where 4 specialist AI agents collaborate to diagnose production issues: ClientAnalyst — Analyzes browser, JavaScript, CORS, uploads, and UI symptoms NetworkAnalyst — Analyzes DNS, TCP/IP, TLS, load balancers, and firewalls ServerAnalyst — Analyzes backend logs, database, deployments, and resource limits Coordinator — Synthesizes all findings into a root cause report with a prioritized action plan These agents don't just run in sequence — they debate, cross-examine, and challenge each other's findings through a shared conversation, producing a diagnosis that's better than any single agent could achieve alone. Table of Contents Why Multi-Agent? The Problem with Single Agents Architecture Overview Understanding the Key SK Components The Actor Model — How InProcessRuntime Works Setting Up Your Development Environment Step-by-Step: Building the Multi-Agent Analyzer The Agent Interaction Flow — Round by Round Bugs I Found & Fixed — Lessons Learned Running with Different AI Providers What to Build Next 1. Why Multi-Agent? The Problem with Single Agents A single AI agent analyzing a production issue is like having one doctor diagnose everything — they'll catch issues in their specialty but miss cross-domain connections. Consider this problem: "Users report 504 Gateway Timeout errors when uploading files larger than 10MB. Started after Friday's deployment. Worse during peak hours." A single agent might say "it's a server timeout" and stop. But the real root cause often spans multiple layers: The client is sending chunked uploads with an incorrect Content-Length header (client-side bug) The load balancer has a 30-second timeout that's too short for large uploads (network config) The server recently deployed a new request body parser that's 3x slower (server-side regression) The combination only fails during peak hours because connection pool saturation amplifies the latency No single perspective catches this. You need specialists who analyze independently, then debate to find the cross-layer causal chain. That's what multi-agent orchestration gives you. The 5 Orchestration Patterns in SK Semantic Kernel provides 5 built-in patterns for agent collaboration: SEQUENTIAL: A → B → C → Done (pipeline — each builds on previous) CONCURRENT: ↗ A ↘ Task → B → Aggregate ↘ C ↗ (parallel — results merged) GROUP CHAT: A ↔ B ↔ C ↔ D ← We use this one (rounds, shared history, debate) HANDOFF: A → (stuck?) → B → (complex?) → Human (escalation with human-in-the-loop) MAGENTIC: LLM picks who speaks next dynamically (AI-driven speaker selection) We use GroupChatOrchestration with RoundRobinGroupChatManager because our problem requires agents to see each other's work, challenge assumptions, and build on each other's analysis across two rounds. 2. Architecture Overview Here's the complete architecture of what we're building: 3. Understanding the Key SK Components Before we write code, let's understand the 5 components we'll use and the design pattern each implements: ChatCompletionAgent — Strategy Pattern The agent definition. Each agent is a combination of: name — unique identifier (used in round-robin ordering) instructions — the persona and rules (this is the prompt engineering) service — which AI provider to call (Strategy Pattern — swap providers without changing agent logic) description — what other agents/tools understand about this agent agent = ChatCompletionAgent( name="ClientAnalyst", instructions="You are ONLY ClientAnalyst...", service=gemini_service, # ← Strategy: swap to OpenAI with zero changes description="Analyzes client-side issues", ) GroupChatOrchestration — Mediator Pattern The orchestration defines HOW agents interact. It's the Mediator — agents don't talk to each other directly. Instead, the orchestration manages a shared ChatHistory and routes messages through the Manager. RoundRobinGroupChatManager — Strategy Pattern The Manager decides WHO speaks next. RoundRobinGroupChatManager cycles through agents in a fixed order. SK also provides AutomaticGroupChatManager where the LLM decides who speaks next. max_rounds is the total number of messages per agent or cycle. With 4 agents and max_rounds=8, each agent speaks exactly twice. InProcessRuntime — Actor Model Abstraction The execution engine. Every agent becomes an "actor" with its own kind of mailbox (message queue). The runtime delivers messages between actors. Key properties: No shared state — agents communicate only through messages Sequential processing — each agent processes one message at a time Location transparency — same code works in-process today, distributed tomorrow agent_response_callback — Observer Pattern A function that fires after EVERY agent response. We use it to display each agent's output in real-time with emoji labels and round numbers. 4. The Actor Model — How InProcessRuntime Works The Actor Model is a concurrency pattern where each entity is an isolated "actor" with a private mailbox. Here's what happens inside InProcessRuntime when we run our demo: runtime.start() │ ├── Creates internal message loop (asyncio event loop) │ orchestration.invoke(task="504 timeout...", runtime=runtime) │ ├── Creates Actor[Orchestrator] → manages overall flow ├── Creates Actor[Manager] → RoundRobinGroupChatManager ├── Creates Actor[ClientAnalyst] → mailbox created, waiting ├── Creates Actor[NetworkAnalyst] → mailbox created, waiting ├── Creates Actor[ServerAnalyst] → mailbox created, waiting └── Creates Actor[Coordinator] → mailbox created, waiting Manager receives "start" message │ ├── Checks turn order: [Client, Network, Server, Coordinator] ├── Sends task to ClientAnalyst mailbox │ → ClientAnalyst processes: calls LLM → response │ → Response added to shared ChatHistory │ → callback fires (displayed in Notebook UI) │ → Sends "done" back to Manager │ ├── Manager updates: turn_index=1 ├── Sends to NetworkAnalyst mailbox │ → Same flow... │ ├── ... (ServerAnalyst, Coordinator for Round 1) │ ├── Manager checks: messages=4, max_rounds=8 → continue │ ├── Round 2: same cycle with cross-examination │ └── After message 8: Manager sends "complete" → OrchestrationResult resolves → result.get() returns final answer runtime.stop_when_idle() → All mailboxes empty → clean shutdown The Actor Model guarantees: No race conditions (each actor processes one message at a time) No deadlocks (no shared locks to contend for) No shared mutable state (agents communicate only via messages) 5. Setting Up Your Development Environment Prerequisites Python 3.11 or 3.12 (3.13+ may have compatibility issues with some SK connectors) Visual Studio Code with the Python and Jupyter extensions An API key from one of: Google AI Studio (free), OpenAI Step 1: Install Python Download from python.org. During installation, check "Add Python to PATH". Verify: python --version # Python 3.12.x Step 2: Install VS Code Extensions Open VS Code, go to Extensions (Ctrl+Shift+X), and install: Python (by Microsoft) — Python language support Jupyter (by Microsoft) — Notebook support Pylance (by Microsoft) — IntelliSense and type checking Step 3: Create Project Folder mkdir sk-multiagent-demo cd sk-multiagent-demo Open in VS Code: code . Step 4: Create Virtual Environment Open the VS Code terminal (Ctrl+`) and run: # Create virtual environment python -m venv sk-env # Activate it # Windows: sk-env\Scripts\activate # macOS/Linux: source sk-env/bin/activate You should see (sk-env) in your terminal prompt. Step 5: Install Semantic Kernel For Google Gemini (free tier — recommended for getting started): pip install semantic-kernel[google] python-dotenv ipykernel For OpenAI (paid API key): pip install semantic-kernel openai python-dotenv ipykernel For Azure AI Foundry (enterprise, Entra ID auth): pip install semantic-kernel azure-identity python-dotenv ipykernel Step 6: Register the Jupyter Kernel python -m ipykernel install --user --name=sk-env --display-name="Semantic Kernel (Python 3.12)" You can also select if this is already available from your environment from VSCode as below: Step 7: Get Your API Key Option A — Google Gemini (FREE, recommended for demo): Go to https://aistudio.google.com/apikey Click "Create API Key" Copy the key Free tier limits: 15 requests/minute, 1 million tokens/minute — more than enough for this demo. Option B — OpenAI: Go to https://platform.openai.com/api-keys Create a new key Copy the key Option C — Azure AI Foundry: Deploy a model in Azure AI Foundry portal Note the endpoint URL and deployment name If key-based auth is disabled, you'll need Entra ID with permissions Step 8: Create the .env File In your project root, create a file named .env: For Gemini: GOOGLE_AI_API_KEY=AIzaSy...your-key-here GOOGLE_AI_GEMINI_MODEL_ID=gemini-2.5-flash For OpenAI: OPENAI_API_KEY=sk-...your-key-here OPENAI_CHAT_MODEL_ID=gpt-4o For Azure AI Foundry: AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4o AZURE_OPENAI_API_KEY=your-key Step 9: Create the Notebook In VS Code: Click File > New File Save as multi_agent_analyzer.ipynb In the top-right of the notebook, click Select Kernel Choose Semantic Kernel (Python 3.12) (or your sk-env) Your environment is ready. Let's build. 6. Step-by-Step: Building the Multi-Agent Analyzer Cell 1: Verify Setup import semantic_kernel print(f"Semantic Kernel version: {semantic_kernel.__version__}") from semantic_kernel.agents import ( ChatCompletionAgent, GroupChatOrchestration, RoundRobinGroupChatManager, ) from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent print("All imports successful") Cell 2: Load API Key and Create Service For Gemini: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.google.google_ai import ( GoogleAIChatCompletion, GoogleAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory GEMINI_API_KEY = os.getenv("GOOGLE_AI_API_KEY") GEMINI_MODEL = os.getenv("GOOGLE_AI_GEMINI_MODEL_ID", "gemini-2.5-flash") service = GoogleAIChatCompletion( gemini_model_id=GEMINI_MODEL, api_key=GEMINI_API_KEY, ) print(f"Service created: Gemini {GEMINI_MODEL}") # Smoke test settings = GoogleAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") For OpenAI: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.open_ai import ( OpenAIChatCompletion, OpenAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory service = OpenAIChatCompletion( ai_model_id=os.getenv("OPENAI_CHAT_MODEL_ID", "gpt-4o"), ) print(f"Service created: OpenAI {os.getenv('OPENAI_CHAT_MODEL_ID', 'gpt-4o')}") # Smoke test settings = OpenAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") Cell 3: Define All 4 Agents This is the most important cell — the prompt engineering that makes the demo work: from semantic_kernel.agents import ChatCompletionAgent # ═══════════════════════════════════════════════════ # AGENT 1: Client-Side Analyst # ═══════════════════════════════════════════════════ client_agent = ChatCompletionAgent( name="ClientAnalyst", description="Analyzes problems from the client-side: browser, JS, CORS, caching, UI symptoms", instructions="""You are ONLY **ClientAnalyst**. You must NEVER speak as NetworkAnalyst, ServerAnalyst, or Coordinator. Every word you write is from ClientAnalyst's perspective only. You are a senior front-end and client-side diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the client side: 1. **Browser & Rendering**: DOM issues, JavaScript errors, CSS rendering, browser compatibility, memory leaks, console errors. 2. **Client-Side Caching**: Stale cache, service worker issues, local storage corruption. 3. **Network from Client View**: CORS errors, preflight failures, request timeouts, client-side retry storms, fetch/XHR configuration. 4. **Upload Handling**: File API usage, chunk upload implementation, progress tracking, FormData construction, content-type headers. 5. **UI/UX Symptoms**: What the user sees, error messages displayed, loading states. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference NetworkAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the client perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 2: Network Analyst # ═══════════════════════════════════════════════════ network_agent = ChatCompletionAgent( name="NetworkAnalyst", description="Analyzes problems from the network side: DNS, TCP, TLS, firewalls, load balancers, latency", instructions="""You are ONLY **NetworkAnalyst**. You must NEVER speak as ClientAnalyst, ServerAnalyst, or Coordinator. Every word you write is from NetworkAnalyst's perspective only. You are a senior network infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the network layer: 1. **DNS & Resolution**: DNS TTL, propagation delays, record misconfigurations. 2. **TCP/IP & Connections**: Connection pooling, keep-alive, TCP window scaling, connection resets, SYN floods. 3. **TLS/SSL**: Certificate issues, handshake failures, protocol version mismatches. 4. **Load Balancers & Proxies**: Sticky sessions, health checks, timeout configs, request body size limits, proxy buffering. 5. **Firewall & WAF**: Rule blocks, rate limiting, request inspection delays, geo-blocking, DDoS protection interference. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the network perspective - Do NOT just say 'I am ready to proceed' — provide substantive technical analysis Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 3: Server-Side Analyst # ═══════════════════════════════════════════════════ server_agent = ChatCompletionAgent( name="ServerAnalyst", description="Analyzes problems from the server side: backend app, database, logs, resources, deployments", instructions="""You are ONLY **ServerAnalyst**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator. Every word you write is from ServerAnalyst's perspective only. You are a senior backend and infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the server side: 1. **Application Server**: Error logs, exception traces, thread pool exhaustion, memory leaks, CPU spikes, garbage collection pauses. 2. **Database**: Slow queries, connection pool saturation, lock contention, deadlocks, replication lag, query plan changes. 3. **Deployment & Config**: Recent deployments, configuration changes, feature flags, environment variable mismatches, rollback candidates. 4. **Resource Limits**: File upload size limits, request body limits, disk space, temporary file cleanup, storage quotas. 5. **External Dependencies**: Upstream API timeouts, third-party service degradation, queue backlogs, cache (Redis/Memcached) issues. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and NetworkAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the server perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 4: Coordinator # ═══════════════════════════════════════════════════ coordinator_agent = ChatCompletionAgent( name="Coordinator", description="Synthesizes all specialist analyses into a final root cause report with prioritized action plan", instructions="""You are ONLY **Coordinator**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or ServerAnalyst. You synthesize — you do NOT do domain-specific analysis. You are the lead engineer who synthesizes the team's findings. ═══ ROUND 1 BEHAVIOR (your first turn, message 4) ═══ Keep this SHORT — maximum 300 words. - Note 2-3 KEY PATTERNS across the three analyses - Identify where specialists AGREE (high-confidence) - Identify where they CONTRADICT (needs resolution) - Ask 2-3 SPECIFIC QUESTIONS for Round 2 Round 1 MUST NOT: assign tasks, create action plans, write reports, or tell agents what to take lead on. Observation + questions ONLY. ═══ ROUND 2 BEHAVIOR (your final turn, message 8) ═══ Keep this FOCUSED — maximum 800 words. Produce a structured report: 1. **Root Cause** (1 paragraph): The #1 most likely cause with causal chain across layers. Reference specific findings from each specialist. 2. **Confidence** (short list): - HIGH: Areas where all 3 agreed - MEDIUM: Areas where 2 of 3 agreed - LOW: Disagreements needing investigation 3. **Action Plan** (numbered, max 6 items): For each: - What to do (specific) - Owner (Client/Network/Server team) - Time estimate 4. **Quick Wins vs Long-term** (2 short lists) Do NOT repeat what specialists already said verbatim. Synthesize, don't echo.""", service=service, ) # ═══════════════════════════════════════════════════ # All 4 agents — order = RoundRobin order # ═══════════════════════════════════════════════════ agents = [client_agent, network_agent, server_agent, coordinator_agent] print(f"{len(agents)} agents created:") for i, a in enumerate(agents, 1): print(f" {i}. {a.name}: {a.description[:60]}...") print(f"\nRoundRobin order: {' → '.join(a.name for a in agents)}") Cell 4: Run the Analysis from semantic_kernel.agents import GroupChatOrchestration, RoundRobinGroupChatManager from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent from IPython.display import display, Markdown # ╔══════════════════════════════════════════════════════════╗ # ║ EDIT YOUR PROBLEM STATEMENT HERE ║ # ╚══════════════════════════════════════════════════════════╝ PROBLEM = """ Users are reporting intermittent 504 Gateway Timeout errors when trying to upload files larger than 10MB through our web application. The issue started after last Friday's deployment and seems worse during peak hours (2-5 PM EST). Some users also report that smaller file uploads work fine but the progress bar freezes at 85% for large files before timing out. """ # ════════════════════════════════════════════════════════════ agent_responses = [] def agent_response_callback(message: ChatMessageContent) -> None: name = message.name or "Unknown" content = message.content or "" agent_responses.append({"agent": name, "content": content}) emoji = { "ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯" }.get(name, "🔹") round_num = (len(agent_responses) - 1) // len(agents) + 1 display(Markdown( f"---\n### {emoji} {name} (Message {len(agent_responses)}, Round {round_num})\n\n{content}" )) MAX_ROUNDS = 8 # 4 agents × 2 rounds = 8 messages exactly task = f"""## Problem Statement {PROBLEM.strip()} ## Discussion Rules You are in a GROUP DISCUSSION with 4 members. You can see ALL previous messages. There are exactly 2 rounds. ### ROUND 1 (Messages 1-4): Independent Analysis - ClientAnalyst, NetworkAnalyst, ServerAnalyst: Analyze from YOUR domain only. Give your top 3 most likely causes with evidence and reasoning. - Coordinator: Note patterns across the 3 analyses. Ask 2-3 specific questions. Do NOT assign tasks yet. ### ROUND 2 (Messages 5-8): Cross-Examination & Final Report - ClientAnalyst, NetworkAnalyst, ServerAnalyst: You MUST reference the OTHER specialists BY NAME. State where you agree, disagree, or have new insights. Answer the Coordinator's questions. Provide SUBSTANTIVE analysis. - Coordinator: Produce the FINAL structured report: root cause, confidence levels, prioritized action plan with owners and time estimates. IMPORTANT: Each agent speaks as THEMSELVES only. Never impersonate another agent.""" display(Markdown(f"## Problem Statement\n\n{PROBLEM.strip()}")) display(Markdown(f"---\n## Discussion Starting — {len(agents)} agents, {MAX_ROUNDS} rounds\n")) # Build and run orchestration = GroupChatOrchestration( members=agents, manager=RoundRobinGroupChatManager(max_rounds=MAX_ROUNDS), agent_response_callback=agent_response_callback, ) runtime = InProcessRuntime() runtime.start() result = await orchestration.invoke(task=task, runtime=runtime) final_result = await result.get(timeout=300) await runtime.stop_when_idle() display(Markdown(f"---\n## FINAL CONCLUSION\n\n{final_result}")) Cell 5: Statistics and Validation print("═" * 55) print(" ANALYSIS STATISTICS") print("═" * 55) emojis = {"ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯"} agent_counts = {} agent_chars = {} for r in agent_responses: agent_counts[r["agent"]] = agent_counts.get(r["agent"], 0) + 1 agent_chars[r["agent"]] = agent_chars.get(r["agent"], 0) + len(r["content"]) for agent, count in agent_counts.items(): em = emojis.get(agent, "🔹") chars = agent_chars.get(agent, 0) avg = chars // count if count else 0 print(f" {em} {agent}: {count} msg(s), ~{chars:,} chars (avg {avg:,}/msg)") print(f"\n Total messages: {len(agent_responses)}") total_chars = sum(len(r['content']) for r in agent_responses) print(f" Total analysis: ~{total_chars:,} characters") # Validation print(f"\n Validation:") import re identity_issues = [] for r in agent_responses: other_agents = [a.name for a in agents if a.name != r["agent"]] for other in other_agents: pattern = rf'(?i)as {re.escape(other)}[,:]?\s+I\b' if re.search(pattern, r["content"][:300]): identity_issues.append(f"{r['agent']} impersonated {other}") if identity_issues: print(f" Identity confusion: {identity_issues}") else: print(f" No identity confusion detected") thin = [r for r in agent_responses if len(r["content"].strip()) < 100] if thin: for t in thin: print(f" Thin response from {t['agent']}") else: print(f" All responses are substantive") Cell 6: Save Report from datetime import datetime timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"analysis_report_{timestamp}.md" with open(filename, "w", encoding="utf-8") as f: f.write(f"# Problem Analysis Report\n\n") f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n") f.write(f"**Agents:** {', '.join(a.name for a in agents)}\n") f.write(f"**Rounds:** {MAX_ROUNDS}\n\n---\n\n") f.write(f"## Problem Statement\n\n{PROBLEM.strip()}\n\n---\n\n") for i, r in enumerate(agent_responses, 1): em = emojis.get(r['agent'], '🔹') round_num = (i - 1) // len(agents) + 1 f.write(f"### {em} {r['agent']} (Message {i}, Round {round_num})\n\n") f.write(f"{r['content']}\n\n---\n\n") f.write(f"## Final Conclusion\n\n{final_result}\n") print(f"Report saved to: {filename}") 7. The Agent Interaction Flow — Round by Round Here's what actually happens during the 8-message orchestration: Round 1: Independent Analysis (Messages 1-4) Msg Agent What They See What They Do 1 ClientAnalyst Problem statement only Analyzes from client perspective: upload chunking, progress bar freezing at 85%, CORS, content-type headers 2 NetworkAnalyst Problem + ClientAnalyst's analysis Gives INDEPENDENT analysis despite seeing msg 1: load balancer timeouts, proxy body size limits, TCP window scaling 3 ServerAnalyst Problem + msgs 1-2 Gives INDEPENDENT analysis: recent deployment regression, request body parser, thread pool exhaustion, disk space 4 Coordinator Problem + msgs 1-3 Observes patterns: "All three mention timeout configuration. ClientAnalyst and NetworkAnalyst both point to body size. Question: Was the deployment a backend-only change or did it include infra?" Round 2: Cross-Examination (Messages 5-8) Msg Agent What They Do 5 ClientAnalyst "I agree with NetworkAnalyst that the load balancer timeout is likely a factor — the 85% freeze point matches the 30-second LB timeout for a 10MB upload on our average upload speed. However, I disagree with ServerAnalyst about thread pool exhaustion because the UI shows a clean 504, not a connection reset." 6 NetworkAnalyst "ServerAnalyst's point about the recent deployment is critical — if the new request parser is 3x slower, that would push uploads past the LB timeout. I can confirm the LB has a 30s idle timeout. The fix is both: increase LB timeout AND optimize the parser." 7 ServerAnalyst "Responding to Coordinator's question: The deployment was backend-only — a new multipart parser using streaming instead of buffered reads. ClientAnalyst is correct that the 504 is from the LB, not the app. The app itself returns 200 after 45 seconds, but the LB kills the connection at 30." 8 Coordinator Produces final structured report with root cause: "The backend deployment introduced a slower multipart parser (45s vs 15s for 10MB). The load balancer's 30s timeout kills the connection at ~85% progress. Fix: immediate — increase LB timeout to 120s. Short-term — optimize parser. Long-term — implement chunked uploads with progress resumption." Notice: The Round 2 analysis is dramatically better than Round 1. Agents reference each other by name, build on each other's findings, and the Coordinator can synthesize a cross-layer causal chain that no single agent could have produced. I made a small adjustment to the issue with Azure Web Apps. Please find the details below from testing carried out using Google Gemini: 8. Bugs I Found & Fixed — Lessons Learned Building this demo taught me several important lessons about multi-agent systems: Bug 1: Agents Speaking Only Once Symptom: Only 4 messages instead of 8. Root cause: The agents list was missing the Coordinator. It was defined in a separate cell and wasn't included in the members list. Fix: All 4 agents must be in the same list passed to GroupChatOrchestration. Bug 2: NetworkAnalyst Says "I'm Ready to Proceed" Symptom: NetworkAnalyst's Round 2 response was just "I'm ready to proceed with the analysis" — no actual content. Root cause: The Coordinator's Round 1 message was assigning tasks ("NetworkAnalyst, please check the load balancer config"), and the agent was acknowledging the assignment instead of analyzing. Fix: Added explicit constraint to Coordinator: "Round 1 MUST NOT assign tasks — observation + questions ONLY." Bug 3: ServerAnalyst Says "As NetworkAnalyst, I..." Symptom: ServerAnalyst's response started with "As NetworkAnalyst, I believe..." Root cause: LLM identity bleeding. When agents share ChatHistory, the LLM sometimes loses track of which agent it's currently playing. This is especially common with Gemini. Fix: Identity anchoring at the very top of every agent's instructions: "You are ONLY ServerAnalyst. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator." Bug 4: Gemini Gives Thin/Empty Responses Symptom: Some agents responded with just one sentence or "I concur." Root cause: Gemini 2.5 Flash is more concise than GPT-4o by default. Without explicit length requirements, it takes shortcuts. Fix: Added "Every response MUST be at least 200 words" and "Answer the Coordinator's questions" to every specialist's instructions. Bug 5: Coordinator's Report is 18K Characters Symptom: The Coordinator's Round 2 response was absurdly long — repeating everything every specialist said. Fix: Added word limits: "Round 1 max 300 words, Round 2 max 800 words" and "Synthesize, don't echo." Bug 6: MAX_ROUNDS Math Symptom: With MAX_ROUNDS=9, ClientAnalyst spoke a 3rd time after the Coordinator's final report — breaking the clean 2-round structure. Fix: MAX_ROUNDS must equal (number of agents × number of rounds). For 4 agents × 2 rounds = 8. 9. Running with Different AI Providers The beauty of SK's Strategy Pattern is that you change ONE LINE to switch providers. Everything else — agents, orchestration, callbacks, validation — stays identical. Gemini setup: from semantic_kernel.connectors.ai.google.google_ai import GoogleAIChatCompletion service = GoogleAIChatCompletion( gemini_model_id="gemini-2.5-flash", api_key=os.getenv("GOOGLE_AI_API_KEY"), ) OpenAI Setup from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion service = OpenAIChatCompletion( ai_model_id="gpt-4o", api_key=os.getenv("OPEN_AI_API_KEY"), ) 10. What to Build Next Add Plugins to Agents Give agents real tools — not just LLM reasoning - looks exciting right ;) class NetworkDiagnosticPlugin: (description="Pings a host and returns latency") def ping(self, host: str) -> str: result = subprocess.run(["ping", "-c", "3", host], capture_output=True, text=True) return result.stdout class LogSearchPlugin: (description="Searches server logs for error patterns") def search_logs(self, pattern: str, hours: int = 1) -> str: # Query your log aggregator (Splunk, ELK, Azure Monitor) return query_logs(pattern, hours) Add Filters for Governance Intercept every agent call for PII redaction and audit logging: .filter(filter_type=FilterTypes.FUNCTION_INVOCATION) async def audit_filter(context, next): print(f"[AUDIT] {context.function.name} called by agent") await next(context) print(f"[AUDIT] {context.function.name} returned") Try Different Orchestration Patterns Replace GroupChat with Sequential for a pipeline approach: # Instead of debate, each agent builds on the previous orchestration = SequentialOrchestration( members=[client_agent, network_agent, server_agent, coordinator_agent] ) Or Concurrent for parallel analysis: # All specialists analyze simultaneously, Coordinator aggregates orchestration = ConcurrentOrchestration( members=[client_agent, network_agent, server_agent] ) Deploy to Azure Move from InProcessRuntime to Azure Container Apps for production scaling. The agent code doesn't change — only the runtime. Summary The key insight from building this demo: multi-agent systems produce better results than single agents not because each agent is smarter, but because the debate structure forces cross-domain thinking that a single prompt can never achieve. The Coordinator's final report consistently identifies causal chains that span client, network, and server layers — exactly the kind of insight that production incident response teams need. Semantic Kernel makes this possible with clean separation of concerns: agents define WHAT to analyze, orchestration defines HOW they interact, the manager defines WHO speaks when, the runtime handles WHERE it executes, and callbacks let you OBSERVE everything. Each piece is independently swappable — that's the power of SK from Microsoft. Resources: GitHub: github.com/microsoft/semantic-kernel Docs: learn.microsoft.com/semantic-kernel Orchestration Patterns: learn.microsoft.com/semantic-kernel/frameworks/agent/agent-orchestration Discord: aka.ms/sk/discord Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for building AI agents with SK (Semantic Kernel) 😀383Views3likes0CommentsSecurity Copilot Integration with Microsoft Sentinel - Why Automation matters now
Security Operations Centers face a relentless challenge - the volume of security alerts far exceeds the capacity of human analysts. On average, a mid-sized SOC receives thousands of alerts per day, and analysts spend up to 80% of their time on initial triage. That means determining whether an alert is a true positive, understanding its scope, and deciding on next steps. With Microsoft Security Copilot now deeply integrated into Microsoft Sentinel, there is finally a practical path to automating the most time-consuming parts of this workflow. So I decided to walk you through how to combine Security Copilot with Sentinel to build an automated incident triage pipeline - complete with KQL queries, automation rule patterns, and practical scenarios drawn from common enterprise deployments. Traditional triage workflows rely on analysts manually reviewing each incident - reading alert details, correlating entities across data sources, checking threat intelligence, and making a severity assessment. This is slow, inconsistent, and does not scale. Security Copilot changes this equation by providing: Natural language incident summarization - turning complex, multi-alert incidents into analyst-readable narratives Automated entity enrichment - pulling threat intelligence, user risk scores, and device compliance state without manual lookups Guided response recommendations - suggesting containment and remediation steps based on the incident type and organizational context The key insight is that Copilot does not replace analysts - it handles the repetitive first-pass triage so analysts can focus on decision-making and complex investigations. Architecture - How the Pieces Fit Together The automated triage pipeline consists of four layers: Detection Layer - Sentinel analytics rules generate incidents from log data Enrichment Layer - Automation rules trigger Logic Apps that call Security Copilot Triage Layer - Copilot analyzes the incident, enriches entities, and produces a triage summary Routing Layer - Based on Copilot's assessment, incidents are routed, re-prioritized, or auto-closed (Forgive my AI-painted illustration here, but I find it a nice way to display dependencies.) +-----------------------------------------------------------+ | Microsoft Sentinel | | | | Analytics Rules --> Incidents --> Automation Rules | | | | | v | | Logic App / Playbook | | | | | v | | Security Copilot API | | +-----------------+ | | | Summarize | | | | Enrich Entities | | | | Assess Risk | | | | Recommend Action| | | +--------+--------+ | | | | | v | | +-----------------------------+ | | | Update Incident | | | | - Add triage summary tag | | | | - Adjust severity | | | | - Assign to analyst/team | | | | - Auto-close false positive| | | +-----------------------------+ | +-----------------------------------------------------------+ Step 1 - Identify High-Volume Triage Candidates Not every incident type benefits equally from automated triage. Start with alert types that are high in volume but often turn out to be false positives or low severity. Use this KQL query to identify your top candidates: SecurityIncident | where TimeGenerated > ago(30d) | summarize TotalIncidents = count(), AutoClosed = countif(Classification == "FalsePositive" or Classification == "BenignPositive"), AvgTimeToTriageMinutes = avg(datetime_diff('minute', FirstActivityTime, CreatedTime)) by Title | extend FalsePositiveRate = round(AutoClosed * 100.0 / TotalIncidents, 1) | where TotalIncidents > 10 | order by TotalIncidents desc | take 20 This query surfaces the incident types where automation will deliver the highest ROI. Based on publicly available data and community reports, the following categories consistently appear at the top: Impossible travel alerts (high volume, around 60% false positive rate) Suspicious sign-in activity from unfamiliar locations Mass file download and share events Mailbox forwarding rule creation Step 2 - Build the Copilot-Powered Triage Playbook Create a Logic App playbook that triggers on incident creation and leverages the Security Copilot connector. The core flow looks like this: Trigger: Microsoft Sentinel Incident - When an incident is created Action 1 - Get incident entities: let incidentEntities = SecurityIncident | where IncidentNumber == <IncidentNumber> | mv-expand AlertIds | join kind=inner (SecurityAlert | extend AlertId = SystemAlertId) on $left.AlertIds == $right.AlertId | mv-expand Entities | extend EntityData = parse_json(Entities) | project EntityType = tostring(EntityData.Type), EntityValue = coalesce( tostring(EntityData.HostName), tostring(EntityData.Address), tostring(EntityData.Name), tostring(EntityData.DnsDomain) ); incidentEntities Note: The <IncidentNumber> placeholder above is a Logic App dynamic content variable. When building your playbook, select the incident number from the trigger output rather than hardcoding a value. Action 2 - Copilot prompt session: Send a structured prompt to Security Copilot that requests: Analyze this Microsoft Sentinel incident and provide a triage assessment: Incident Title: {IncidentTitle} Severity: {Severity} Description: {Description} Entities involved: {EntityList} Alert count: {AlertCount} Please provide: 1. A concise summary of what happened (2-3 sentences) 2. Entity risk assessment for each IP, user, and host 3. Whether this appears to be a true positive, benign positive, or false positive 4. Recommended next steps 5. Suggested severity adjustment (if any) Action 3 - Parse and route: Use the Copilot response to update the incident. The Logic App parses the structured output and: Adds the triage summary as an incident comment Tags the incident with copilot-triaged Adjusts severity if Copilot recommends it Routes to the appropriate analyst tier based on the assessment Step 3 - Enrich with Contextual KQL Lookups Security Copilot's assessment improves dramatically when you feed it contextual data. Before sending the prompt, enrich the incident with organization-specific signals: // Check if the user has a history of similar alerts (repeat offender vs. first time) let userAlertHistory = SecurityAlert | where TimeGenerated > ago(90d) | mv-expand Entities | extend EntityData = parse_json(Entities) | where EntityData.Type == "account" | where tostring(EntityData.Name) == "<UserPrincipalName>" | summarize PriorAlertCount = count(), DistinctAlertTypes = dcount(AlertName), LastAlertTime = max(TimeGenerated) | extend IsRepeatOffender = PriorAlertCount > 5; userAlertHistory // Check user risk level from Entra ID Protection AADUserRiskEvents | where TimeGenerated > ago(7d) | where UserPrincipalName == "<UserPrincipalName>" | summarize arg_max(TimeGenerated, RiskLevel), RecentRiskEvents = count() | project RiskLevel, RecentRiskEvents Including this context in the Copilot prompt transforms generic assessments into organization-aware triage decisions. A "suspicious sign-in" for a user who travels internationally every week is very different from the same alert for a user who has never left their home country. Step 4 - Implement Feedback Loops Automated triage is only as good as its accuracy over time. Build a feedback mechanism by tracking Copilot's assessments against analyst final classifications: SecurityIncident | where Tags has "copilot-triaged" | where TimeGenerated > ago(30d) | where Classification != "" | mv-expand Comments | extend CopilotAssessment = extract("Assessment: (True Positive|False Positive|Benign Positive)", 1, tostring(Comments)) | where isnotempty(CopilotAssessment) | summarize Total = dcount(IncidentNumber), Correct = dcountif(IncidentNumber, (CopilotAssessment == "False Positive" and Classification == "FalsePositive") or (CopilotAssessment == "True Positive" and Classification == "TruePositive") or (CopilotAssessment == "Benign Positive" and Classification == "BenignPositive") ) by bin(TimeGenerated, 7d) | extend AccuracyPercent = round(Correct * 100.0 / Total, 1) | order by TimeGenerated asc For this query to work reliably, the automation playbook must write the assessment in a consistent format within the incident comments. Use a structured prefix such as Assessment: True Positive so the regex extraction remains stable. According to Microsoft's published benchmarks and community feedback, Copilot-assisted triage typically achieves 85-92% agreement with senior analyst classifications after prompt tuning - significantly reducing the manual triage burden. A Note on Licensing and Compute Units Security Copilot is licensed through Security Compute Units (SCUs), which are provisioned in Azure. Each prompt session consumes SCUs based on the complexity of the request. For automated triage at scale, plan your SCU capacity carefully - high-volume playbooks can accumulate significant usage. Start with a conservative allocation, monitor consumption through the Security Copilot usage dashboard, and scale up as you validate ROI. Microsoft provides detailed guidance on SCU sizing in the official Security Copilot documentation. Example Scenario - Impossible Travel at Scale Consider a typical enterprise that generates over 200 impossible travel alerts per week. The SOC team spends roughly 15 hours weekly just triaging these. Here is how automated triage addresses this: Detection - Sentinel's built-in impossible travel analytics rule flags the incidents Enrichment - The playbook pulls each user's typical travel patterns from sign-in logs over the past 90 days, VPN usage, and whether the "impossible" location matches any known corporate office or VPN egress point Copilot Analysis - Security Copilot receives the enriched context and classifies each incident Expected Result - Based on common deployment patterns, around 70-75% of impossible travel incidents are auto-closed as benign (VPN, known travel patterns), roughly 20% are downgraded to informational with a triage note, and only about 5% are escalated to analysts as genuine suspicious activity This type of automation can reclaim over 10 hours per week - time that analysts can redirect to proactive threat hunting. Getting Started - Practical Recommendations For teams ready to implement automated triage with Security Copilot and Sentinel, here is a recommended approach: Start small. Pick one high-volume, high-false-positive incident type. Do not try to automate everything at once. Run in shadow mode first. Have the playbook add triage comments but do not auto-close or re-route. Let analysts compare Copilot's assessment with their own for two to four weeks. Tune your prompts. Generic prompts produce generic results. Include organization-specific context - naming conventions, known infrastructure, typical user behavior patterns. Monitor accuracy continuously. Use the feedback loop KQL above. If accuracy drops below 80%, pause automation and investigate. Maintain human oversight. Even at 90%+ accuracy, keep a human review step for high-severity incidents. Automation handles volume - analysts handle judgment. The combination of Security Copilot and Microsoft Sentinel represents a genuine step forward for SOC efficiency. By automating the initial triage pass - summarizing incidents, enriching entities, and providing classification recommendations - analysts are freed to focus on what humans do best: making nuanced security decisions under uncertainty. Feel free to like or/and connect :)70Views0likes0Comments