management
85 TopicsEmpower Smarter AI Agent Investments
This curated series of modules is designed to equip technical and business decision-makers, including IT, developers, engineers, AI engineers, administrators, solution architects, business analysts, and technology managers, with the practical knowledge and guidance needed to make cost-conscious decisions at every stage of the AI agent journey. From identifying high-impact use cases and understanding cost drivers, to forecating ROI, adopting best practices, designing scalable and effective architectures, and optimizing ongoing investments, this learning path provides actionable guidance for building, deploying, and managing AI agents on Azure with confidence. Whether you’re just starting your AI journey or looking to scale enterprise adoption, these modules will help you align innovation with financial discipline, ensuring your AI agent initiatives deliver sustainable value and long-term success. Discover the full learning path here: aka.ms/Cost-Efficient-AI-Agents Explore the sections below for an overview of each module included in this learning path, highlighting the core concepts, practical strategies, and actionable insights designed to help you maximize the value of AI agent investments on Azure: Module 1: Identify and Prioritize High-Impact, Cost-Effective AI Agent Use Cases The journey begins with a strategic approach to selecting AI agent use cases that maximize business impact and cost efficiency. This module introduces a structured framework for researching proven use cases, collaborating across teams, and defining KPIs to evaluate feasibility and ROI. You’ll learn how to target “quick wins” while ensuring alignment with organizational goals and resource constraints. Explore this module Module 2: Understand the Key Cost Drivers of AI Agents Building on the foundation of use case selection, Module 2 dives into the core cost drivers of AI agent development and operations on Azure. It covers infrastructure, integration, data quality, team expertise, and ongoing operational expenses, offering actionable strategies to optimize spending at every stage. The module emphasizes right-sizing resources, efficient data preparation, and leveraging Microsoft tools to streamline development and ensure sustainable, scalable success. Explore this module Module 3: Forecast the Return on Investment (ROI) of AI agents With a clear understanding of costs, the next step is to quantify value. Module 3 empowers both business and technical leaders with practical frameworks for forecasting and communicating ROI, even without a finance background. Through step-by-step guides and real-world examples, you’ll learn to measure tangible and intangible outcomes, apply NPV calculations, and use sensitivity analysis to prioritize AI investments that align with broader organizational objectives. Explore this module Module 4: Implement Best Practices to Empower AI Agent Efficiency and Ensure Long-Term Success To drive efficiency and governance at scale, Module 4 introduces essential frameworks such as the AI Center of Excellence (CoE), FinOps, GenAI Ops, the Cloud Adoption Framework (CAF), and the Well-Architected Framework (WAF). These best practices help organizations accelerate adoption, optimize resources, and foster operational excellence, ensuring AI agents deliver measurable value, remain secure, and support sustainable enterprise growth. Explore this module Module 5: Maximize Cost Efficiency by Choosing the Right AI Agent Development Approach Selecting the right development approach is critical for balancing speed, customization, and cost. In Module 5, you’ll learn how to align business needs and technical skills with SaaS, PaaS, or IaaS options, empowering both business users and developers to efficiently build, deploy, and manage AI agents. The module also highlights how Microsoft Copilot Studio, Visual Studio, and Azure AI Foundry can help your organization achieve its goals. Explore this module Module 6: Architect Scalable and Cost-Efficient AI Agent Solutions on Azure As your AI initiatives grow, architectural choices become paramount. Module 6 explores how to leverage Azure Landing Zones and reference architectures for secure, well-governed, and cost-optimized deployments. It compares single-agent and multi-agent systems, highlights strategies for cost-aware model selection, and details best practices for governance, tagging, and pricing, ensuring your AI solutions remain flexible, resilient, and financially sustainable. Explore this module Module 7: Manage and Optimize AI Agent Investments on Azure The learn path concludes with a focus on operational excellence. Module 7 provides guidance on monitoring agent performance and spending using Azure AI Foundry Observability, Azure Monitor Application Insights, and Microsoft Cost Management. Learn how to track key metrics, set budgets, receive real-time alerts, and optimize resource allocation, empowering your organization to maximize ROI, stay within budget, and deliver ongoing business value. Explore this module Ready to accelerate your AI agent journey with financial confidence? Start exploring the new learning path and unlock proven strategies to maximize the cost efficiency of your AI agents on Azure, transforming innovation into measurable, sustainable business success. Get started todayCloud and AI Cost Efficiency: A Strategic Imperative for Long-Term Business Growth
In this blog, we’ll explore why cost efficiency is a top priority for organizations today, how Azure Essentials can help address this challenge, and provide an overview of Microsoft’s solutions, tools, programs, and resources designed to help organizations maximize the value of their cloud and AI investments.A New Platform Management Group & Subscription for Security in Azure landing zone (ALZ)
At the start of 2025, during the January 2025 ALZ Community Call, we asked everyone for their feedback, via these discussions on our GitHub repo: 1898 & 1978 , on the future of Microsoft Sentinel in the Azure landing zone (ALZ) architecture as we were receiving feedback that it needed some changes and additional clarity from what ALZ was deploying and advising then. We have now worked with customers, partners, and internal teams to figure out what we should update in ALZ around Microsoft Sentinel and Security tooling and have updated the ALZ conceptual architecture to show this. What did ALZ advise and deploy before, by default? Prior to these updates ALZ advised the following: The central Log Analytics Workspace (LAW) in the Management Subscription should Be used to capture all logs, including security/SIEM logs The Microsoft Sentinel solution (called Security) should be installed upon this LAW also And in the accelerators and tooling it deployed, by default: The central Log Analytics Workspace (LAW) in the Management Subscription with the Microsoft Sentinel solution installed Microsoft Sentinel had no additional configuration apart from being installed as a solution on the central LAW What are the changes being made to ALZ from today? Based on the feedback from the GitHub discussions and working with customers, partners and internal teams we are making the following changes: A new dedicated Security Management Group beneath the Platform Management Group A new dedicated Security Subscription placed in the new Security Management Group Nothing will be deployed into this subscription by ALZ by default. This allows: Customers & partners to deploy and manage the Microsoft Sentinel deployment how they wish to The 31-day 10GB/day free trial can be started when the customer or partner is ready to utilise it No longer deploy the Microsoft Sentinel solution (called Security) on the central LAW in the Management Subscription This allows for separating of operational/platform logs from security logs, as per considerations documented in Design a Log Analytics workspace architecture The changes have only been made to our ALZ CAF/MS Learn guidance as of now, and the changes to the accelerators and implementation tools will be made over the coming months 👍 These changes can be seen in the latest ALZ conceptual architecture snippet below The full ALZ conceptual architecture can be seen here on MS Learn. You can also download a Visio or PDF copy of all the ALZ diagrams. What if we have already deployed ALZ? If you have already deployed ALZ and haven't tailored the ALZ default Management Group hierarchy to create a Security Management Group then you can now review and decide whether this is something you'd like to create and align with. While not mandatory, this enhancement to the ALZ architecture is recommended for new customers. The previous approach remains valid; however, feedback from customers, partners, and internal teams indicates that using a dedicated Microsoft Sentinel and Log Analytics Workspace within a separate security-focused Subscription and Management Group is a common real-world practice. To reflect these real-world implementations and feedback, we’re evolving the ALZ conceptual architecture accordingly 👍 Closing We hope you benefit from this update to the ALZ conceptual architecture. As always we welcome any feedback via our GitHub IssuesGA: Enhanced Audit in Azure Security Baseline for Linux
We’re thrilled to announce the General Availability (GA) of the Enhanced Azure Security Baseline for Linux—a major milestone in cloud-native security and compliance. This release brings powerful, audit-only capabilities to over 1.6 million Linux devices across all Azure regions, helping enterprise customers and IT administrators monitor and maintain secure configurations at scale. What Is the Azure Security Baseline for Linux? The Azure Security Baseline for Linux is a set of pre-configured security recommendations delivered through Azure Policy and Azure Machine Configuration. It enables organizations to continuously audit Linux virtual machines and Arc-enabled servers against industry-standard benchmarks—without enforcing changes or triggering auto-remediation. This GA release focuses on enhanced audit capabilities, giving teams deep visibility into configuration drift and compliance gaps across their Linux estate. For our remediation experience, there is a limited public preview available here: What is the Azure security baseline for Linux? | Microsoft Learn Why Enhanced Audit Matters In today’s hybrid environments, maintaining compliance across diverse Linux distributions is a challenge. The enhanced audit mode provides: Granular insights into each configuration check Industry aligned benchmark for standardized security posture Detailed rule-level reporting with evidence and context Scalable deployment across Azure and Arc-enabled machines Whether you're preparing for an audit, hardening your infrastructure, or simply tracking configuration drift, enhanced audit gives you the clarity and control you need—without enforcing changes. Key Features at GA ✅ Broad Linux Distribution Support 📘 Full distro list: Supported Client Types 🔍 Industry-Aligned Audit Checks The baseline audits over 200+ security controls per machine, aligned to industry benchmarks such as CIS. These checks cover: OS hardening Network and firewall configuration SSH and remote access settings Logging and auditing Kernel parameters and system services Each finding includes a description and the actual configuration state—making it easy to understand and act on. 🌐 Hybrid Cloud Coverage The baseline works across: Azure virtual machines Arc-enabled servers (on-premises or other clouds) This means you can apply a consistent compliance standard across your entire Linux estate—whether it’s in Azure, on-prem, or multi-cloud. 🧠 Powered by Azure OSConfig The audit engine is built on the open-source Azure OSConfig framework, which performs Linux-native checks with minimal performance impact. OSConfig is modular, transparent, and optimized for scale—giving you confidence in the accuracy of audit results. 📊 Enterprise-Scale Reporting Audit results are surfaced in: Azure Policy compliance dashboard Azure Resource Graph Explorer Microsoft Defender for Cloud (Recommendations view) You can query, export, and visualize compliance data across thousands of machines—making it easy to track progress and share insights with stakeholders. 💰 Cost There’s no premium SKU or license required to use the audit capabilities with charges only applying to the Azure Arc managed workloads hosted on-premises or other CSP environments—making it easy to adopt across your environment. How to Get Started Review the Quickstart Guide 📘 Quickstart: Audit Azure Security Baseline for Linux Assign the Built-In Policy Search for “Linux machines should meet requirements for the Azure compute security baseline” in Azure Policy and assign it to your desired scope. Monitor Compliance Use Azure Policy and Resource Graph to track audit results and identify non-compliant machines. Plan Remediation While this release does not include auto-remediation, the detailed audit findings make it easy to plan manual or scripted fixes. Final Thoughts This GA release marks a major step forward in securing Linux workloads at scale. With enhanced audit now available, enterprise teams can: Improve visibility into Linux security posture Align with industry benchmarks Streamline compliance reporting Reduce risk across cloud and hybrid environmentsDesigning for Certainty: How Azure Capacity Reservations Safeguard Mission‑Critical Workloads
Why capacity reservations matter now Cloud isn’t running out of metal, but demand is compounding and often spikes. Resource strain shows up in specific regions, zones, and VM SKUs, especially for popular CPU families, memory-optimized sizes, and anything involving GPUs. Seasonal events (retail peaks), regulatory cutovers, emergency response, and bursty AI pipelines can trigger sudden surges. Even with healthy regional capacity, a single zone or a specific SKU can be tight. Capacity reservations acknowledge this reality and make it designable instead of probabilistic. Root reality: Capacity is finite at the SKU-in-zone granularity, and demand arrives in waves. Risk profile: The risk is not “no capacity in the cloud,” but “no capacity for this exact size in this exact place at this exact moment.” Strategic move: Reserve what matters, where it matters, before you need it. What capacity means in practice Think of three dimensions: region, zone, and SKU. Your workload’s SLO ties to all three. Region: The biggest pool of resources. It gives you flexibility but doesn’t guarantee availability in a specific zone. Zone: This is where fault isolation happens and where you’ll often feel the pinch first when demand spikes. SKU: The specific type of machine you’re asking for. This is usually the tightest constraint, especially for popular sizes like Dv5, Ev5, or anything with GPUs. Azure Capacity Reservations let you lock capacity for a specific VM size at the regional or zonal scope and then place VMs/scale sets into that reservation. Pay‑as‑you‑go vs capacity reservations vs reserved instances Attribute Pay‑as‑you‑go Capacity Reservations Reserved Instances Primary purpose Flexibility, no commitment Guarantee availability for a VM size Reduce price for steady usage What it guarantees Nothing beyond current availability Capacity in region/zone for N of a SKU Discount on matching usage (1‑ or 3‑year term) Scope Region/zone at runtime, best‑effort Bound to region or specific zone Billing benefit across scope rules Commitment None Active while you keep it (on‑demand) Term commitment (1 or 3 years) Key clarifications Capacity reservations ≠ discount tool: They exist to secure availability. You pay while the reservation is active (even if idle) because Azure is holding that capacity for you. Reserved Instances ≠ capacity guarantee: They reduce the rate you pay when you run matching VMs, but they don’t hold hardware for you. Together: Use Capacity Reservations to ensure the VMs can run; use Reserved Instances to lower the cost of the runtime those VMs consume. This is universal, not just Azure Every major cloud faces the same physics: finite hardware, localized spikes, SKU-specific constraints, and growth in high-demand families (especially GPUs). AWS offers On‑Demand Capacity Reservations; Google Cloud offers zonal reservations. The names differ; the pattern and the need are the same. If your architecture depends on “must run here, as this size, and right now,” you either design for capacity or accept availability risk. When mission‑critical means “reserve it” If failure to acquire capacity breaks your SLO, treat capacity as a dependency to engineer, not a variable to assume. High-stakes cutovers and events: Examples: Black Friday, tax deadlines, trading close, clinical batch windows. Action: Pre‑reserve the exact SKU in the exact zones for the surge window. HA across zones: Goal: Survive a zone failure by scaling in active zones. Action: Consider keeping extra capacity in each zone based on your failover plan, whether that’s N+1 or matching peak load, depending on active/active vs. active/passive. Change windows that deallocate/recreate: Risk: If a VM is deallocated during maintenance, it might not get the same placement when restarted. Action: Associate VMs/VMSS with a capacity reservation group before deallocation. Fixed‑SKU dependencies: Signal: Performance needs, licensing rules, or hardware accelerators that lock you into a specific VM family. Action: Reserve by SKU. If possible, define fallback SKUs and split reservations across them. Regulated or latency‑sensitive workloads: Constraint: Must run in a specific zone or region due to compliance or latency. Action: Prefer zonal reservations to control both locality and availability. How reserved instances complement capacity reservations Two-layer strategy: Layer 1: Availability: Capacity reservations ensure your compute can be placed when needed. Layer 2: Economics: Reserved Instances (or Savings Plans) apply a pricing benefit to the steady‑state hours you actually run. Practical pairing: Steady base load: Cover with 1/3‑year Reserved Instances for maximum savings. Critical surge headroom: Hold with Capacity Reservations; if the surge is predictable, you can still layer partial RI coverage aligned to expected utilization. Dynamic burst: Leave as pay‑as‑you‑go or use short‑lived reservations during known windows. FinOps hygiene: Coverage ratios: Track RI coverage and capacity reservation utilization separately. Rightsizing: Align reservations to the SKU mix you truly run; shift or cancel idle capacity reservations quickly. Chargeback: Attribute the cost of “insurance” (capacity) to the workloads that require the SLO, separate from the cost of “fuel” (compute hours). Conclusion In today’s cloud landscape, resilience isn’t just redundancy; it’s about assured access to the exact resources your workload demands. Capacity Reservations remove uncertainty by guaranteeing placement, while Reserved Instances drive cost efficiency for predictable use. Together, they form a strategic duo that keeps mission‑critical services running smoothly under any demand surge. Build with both in mind, and you turn capacity from a risk into a controlled asset.Announcing Public Preview for Azure Service Groups!
What are Service groups? Service Groups are a new resource container enabling management and observability scenarios where flexibility in hierarchy and membership is needed. Service Groups are tenant level resources so they can have members across the tenant but do not interfere or use tenant-wide RBAC or Policy abilities. Key Features Low Privilege Management: Service Groups are designed to operate with minimal permissions, ensuring that users can manage resources without needing excessive access and appealing to multiple personas. Access to a Service Group does not grant role-based access control or policy inheritance to its members. Flexible and Varying Hierarchies: Azure resources and scopes, from anywhere in the tenant, can become members of one or multiple service groups. Additionally, Service Groups can be nested providing the ability to have multiple hierarchy structures, i.e. Cost Center, Product, Organization, and more! Monitoring Capabilities: From your application to infrastructure health, Azure Monitor features (such as Health Models) are now available to help you troubleshoot, investigate, and monitor your Service Group. When should I use them? Service Groups should be leveraged in scenarios where resources sprawl across existing containers making it difficult to monitor and manage them. This is commonly found in scenarios needing to model application hierarchy, company services and workloads. Service Groups cannot be used as a deployment scope nor to manage Policy nor RBAC. Try it out! Quickly start with Service Groups using REST API or Azure Portal! For more information on Service Groups, please visit aka.ms/servicegroups. FAQ Do Service Groups replace existing Azure groups? No, Service Groups have been designed to work in parallel with existing Azure Groups. For a comparison of existing scopes, please review the scenario comparison documentation. Who can create Service Groups? Anyone with a valid Azure user account in a Microsoft Entra directory can leverage Service Groups! Why are Service Groups tenant level? Service Groups are tenant level so they can have membership from across the tenant. However, unlike pre-existing tenant level resources (i.e, Management Groups), Service Groups do not have grant users' tenant wide access. Share Your Feedback You can reach our team by email at azureservicegroups@microsoft.com.Upgrading your server and client TLS protocol just got easier using Automanage Machine Configuration
Ensuring secure communication protocols across server environments has been a clear requirement for IT admins, operators, and developers for the past two decades. What wasn’t clear was how to set a desired communication protocol and maintain this at scale, until now. Tech Community