security
21 TopicsUbuntu Pro FIPS 22.04 LTS on Azure: Secure, compliant, and optimized for regulated industries
Organizations across government (including local and federal agencies and their contractors), finance, healthcare, and other regulated industries running workloads on Microsoft Azure now have a streamlined path to meet rigorous FIPS 140-3 compliance requirements. Canonical is pleased to announce the availability of Ubuntu Pro FIPS 22.04 LTS on the Azure Marketplace, featuring newly certified cryptographic modules. This offering extends the stability and comprehensive security features of Ubuntu Pro, tailored for state agencies, federal contractors, and industries requiring a FIPS-validated foundation on Azure. It provides the enterprise-grade Ubuntu experience, optimized for performance on Azure in collaboration with Microsoft, and enhanced with critical compliance capabilities. For instance, if you are building a Software as a Service (SaaS) application on Azure that requires FedRAMP authorization, utilizing Ubuntu Pro FIPS 22.04 LTS can help you meet specific controls like SC-13 (Cryptographic Protection), as FIPS 140-3 validated modules are a foundational requirement. This significantly streamlines your path to achieving FedRAMP compliance. What is FIPS 140-3 and why does it matter? FIPS 140-3 is the latest iteration of the benchmark U.S. government standard for validating cryptographic module implementations, superseding FIPS 140-2. Managed by NIST, it's essential for federal agencies and contractors and is a recognized best practice in many regulated industries like finance and healthcare. Using FIPS-validated components helps ensure cryptography is implemented correctly, protecting sensitive data in transit and at rest. Ubuntu Pro FIPS 22.04 LTS includes FIPS 140-3 certified versions of the Linux kernel and key cryptographic libraries (like OpenSSL, Libgcrypt, GnuTLS) pre-enabled, which are drop-in replacements for the standard packages, greatly simplifying deployment for compliance needs. The importance of security updates (fips-updates) A FIPS certificate applies to a specific module version at its validation time. Over time, new vulnerabilities (CVEs) are discovered in these certified modules. Running code with known vulnerabilities poses a significant security risk. This creates a tension between strict certification adherence and maintaining real-world security. Recognizing this, Canonical provides security fixes for the FIPS modules via the fips-updates stream, available through Ubuntu Pro. We ensure these security patches do not alter the validated cryptographic functions. This approach aligns with modern security thinking, including recent FedRAMP guidance, which acknowledges the greater risk posed by unpatched vulnerabilities compared to solely relying on the original certified binaries. Canonical strongly recommends all users enable the fips-updates repository to ensure their systems are both compliant and secure against the latest threats. FIPS 140-3 vs 140-2 The new FIPS 140-3 standard includes modern ciphers such as TLS v1.3, as well as deprecating older algorithms like MD5. If you are upgrading systems and workloads to FIPS 140-3, it will be necessary to perform rigorous testing to ensure that applications continue to work correctly. Compliance tooling Included Ubuntu Pro FIPS also includes access to Canonical's Ubuntu Security Guide (USG) tooling, which assists with automated hardening and compliance checks against benchmarks like CIS and DISA-STIG, a key requirement for FedRAMP deployments. How to get Ubuntu Pro FIPS on Azure You can leverage Ubuntu Pro FIPS 22.04 LTS on Azure in two main ways: Deploy the Marketplace Image: Launch a new VM directly from the dedicated Ubuntu Pro FIPS 22.04 LTS listing on the Azure Marketplace. This image comes with the FIPS modules pre-enabled for immediate use. Enable on an Existing Ubuntu Pro VM: If you already have an Ubuntu Pro 22.04 LTS VM running on Azure, you can enable the FIPS modules using the Ubuntu Pro Client (pro enable fips-updates). Upgrading standard Ubuntu: If you have a standard Ubuntu 22.04 LTS VM on Azure, you first need to attach Ubuntu Pro to it. This is a straightforward process detailed in the Azure documentation for getting Ubuntu Pro. Once Pro is attached, you can enable FIPS as described above. Learn More Ubuntu Pro FIPS provides a robust, maintained, and compliant foundation for your sensitive workloads on Azure. Watch Joel Sisko from Microsoft speak with Ubuntu experts in this webinar Explore all features of Ubuntu Pro on Azure Read details on the FIPS 140-3 certification for Ubuntu 22.04 LTS Official NIST certification link682Views2likes0CommentsIntroducing Azure Container Linux (ACL)
Today at Microsoft Build 2026, we’re announcing the general availability of Azure Container Linux (ACL): a secure, immutable container host designed to help platform teams run Kubernetes workloads at scale on Azure Kubernetes Service (AKS) with greater consistency, reduced operational overhead, and a stronger default security posture. This release builds on Microsoft’s long-standing commitment to the Flatcar Container Linux ecosystem as a foundation for secure, minimal, and container-optimized operating systems. This commitment includes the acquisition of Kinvolk in 2021, bringing deep expertise in Flatcar development and cloud-native systems into Azure, and the subsequent donation of Flatcar to the Cloud Native Computing Foundation (CNCF), ensuring its continued growth as a community-driven project. Flatcar has played a critical role in helping customers run cloud-native infrastructure at scale, introducing an immutable, minimal OS model that reduces configuration drift, minimizes attack surface, and simplifies lifecycle management. As customer needs continue to grow, there is an increasing demand for deeper integration with cloud platforms, stronger default security enforcement, and a more tightly managed supply chain experience in managed environments like AKS. Building on this foundation, Azure Container Linux (ACL) represents the next evolution of this approach. ACL is intentionally built downstream of Flatcar to preserve compatibility with its ecosystem and leverage its mature, battle-tested design. ACL integrates Azure Linux binaries as the core foundation, providing consistency and compatibility with other Azure Linux use cases (including Azure Linux VMs), while bringing enterprise-hardened security and supportability into the platform. Looking ahead, ACL will further incorporate optional advanced code integrity capabilities from Azure Linux with OS Guard. We remain committed to the Flatcar community and will continue contributing innovations upstream while bringing a fully managed, enterprise-ready product to customers through ACL. Why a Trusted, Immutable Host Model Matters for AKS As Kubernetes adoption scales, platform teams face increasing complexity in managing node-level consistency, security, and lifecycle operations across large fleets. Traditional OS models introduce challenges such as: Configuration drift across nodes, leading to inconsistent behavior and harder-to-debug issues Fragmented update mechanisms that increase operational overhead and risk during upgrades Expanding attack surface due to unnecessary packages and mutable system state Limited visibility and guarantees around the provenance and integrity of OS components In managed environments like AKS, these challenges are amplified as teams look to operate clusters reliably at scale while meeting stricter security and compliance requirements. Azure Container Linux: Built for Consistency and Trust ACL addresses these challenges with a fully image-based operating system model that eliminates configuration drift, ensuring consistent behavior across nodes. Updates are delivered through AKS node image upgrades, providing a consistent and repeatable way to roll out OS changes across clusters without relying on in-place modifications. By standardizing how nodes are built, updated, and operated, ACL helps ensure clusters remain in a known-good, reproducible state over time, even as they scale. Over time, this model will continue to evolve to support A/B update mechanisms to further improve reliability, speed, and operational efficiency. Secure from the Start, and Designed for the Future ACL is engineered with a hardened security posture from the moment it boots. Its immutable design protects the integrity of the operating system, prevents unauthorized changes, and ensures consistent, reproducible behavior across your Kubernetes fleet. By removing unnecessary components and tightly constraining how the system can be modified, ACL reduces the attack surface and provides a strong foundation for running production workloads with confidence. Under the hood, ACL incorporates several safeguards that reinforce its secure-by-default model: Read-only /usr filesystem to prevent tampering with core system components. A minimal package set purpose-built for container workloads, reducing CVE exposure. Mandatory access control with SELinux, enforcing strict least-privilege policies. Trusted Launch using a Unified Kernel Image (UKI) to bundle the kernel, initramfs, and kernel command line into a single signed artifact, ensuring integrity from the earliest stage. Signed Azure Linux RPMs delivered through a trusted, end-to-end Microsoft supply chain. Going forward, we will continue to evolve ACL’s security posture as we bring over additional innovations from Azure Linux with OS Guard. This includes integrating code integrity into the ACL image, using the Integrity Policy Enforcement (IPE) Linux security module, to ensure that only binaries from trusted, signed volumes are allowed to execute. IPE will also extend to container images, ensuring that only binaries matching a trusted signature can be executed from verified dm-verity backed layers. Where applicable, we are committed to contributing these advancements upstream to the Flatcar project, helping strengthen the ecosystem and ensuring that improvements benefit the broader cloud-native community. Differentiating between Azure Container Linux and Existing Container Hosts on AKS AKS now provides multiple generally available Linux OS options, including general-purpose container hosts (Azure Linux and Ubuntu) and an immutable container host (Azure Container Linux). While all options are fully supported by Microsoft, they are designed to address distinct operational and security use cases. The sections below highlight the key differences to help you choose and position the right OS for your scenario. General Purpose OS Azure Container Linux Filesystem Writable (read-write) Immutable (read-only) /usr with dm-verity guarantees Focus on Extensibility, flexibility, and choice. Out of the box security and compliance guarantees. Mandatory Access Control AppArmor (optional) SELinux (enforcing by default)* Secure Boot Optional (supported with certain VM sizes) Supported by default with UKI (Unified Kernel Image) Updates Package and Image based updates supported Only image-based updates supported (A/B update support on the roadmap) *SELinux policies are subject to change over time based on customer feedback. Day‑1 Ecosystem Partner Support Azure Container Linux is launching with support from a broad ecosystem of security, monitoring, networking, and data partners. The following partners are expected to offer support or validated integrations at Day‑1 availability: Dynatrace – application performance monitoring and observability. Aquasec – database platform support on ACL. Qualys - vulnerability, compliance, and container security. Upwind - runtime cloud security and risk prioritization. Elastic - logs, metrics, and observability for Kubernetes. Isovalent – Kubernetes networking, observability, and security powered by eBPF (Cilium). If you’re interested in becoming a supported Azure Container Linux partner, please reach out to: AzureLinuxPartners@microsoft.com What Customers Are Saying Early customer feedback highlights the real‑world impact of Azure Container Linux on improving security posture and operational consistency at scale. “We’ve found working closely with the Microsoft product team throughout the Azure Container Linux preview to be invaluable. The product's immutability, minimal footprint, and built‑in security controls (such as SELinux and Trusted Launch) will strengthen our AKS security posture across every deployment instance in Nationwide. Furthermore, its focus on secure‑by‑design foundations is especially timely as we face advanced threat detection capabilities within the industry.” - Enterprise Container Platform, Cloud - Nationwide Engineered for AKS from Day One Azure Container Linux is deeply integrated with AKS to ensure a seamless operational experience. It is compatible with many critical AKS extensions and add‑ons, and works smoothly with existing application containers and deployment workflows. ACL is available across AMD64 and Arm64 architectures, ensuring consistent behavior across environments, and includes support for GPU-enabled workloads. Enabling ACL is as simple as specifying the following in your node pool configuration: --os-sku AzureContainerLinux Whether you're onboarding new clusters or migrating existing ones, ACL is designed to integrate into your environment with minimal friction. A Clear Path Forward for AKS Preview Users With the release of Azure Container Linux, AKS will transition to offer one unified immutable host offering. This work started with our use of Flatcar Container Linux in Preview and now continues with the GA release of ACL. As part of this release, Flatcar will no longer be available via --os-sku on AKS. Please note, this change applies specifically to the AKS preview experience; Flatcar is not being retired. Later this year we will complete the convergence of our immutable OS offerings by incorporating remaining kernel and runtime features of the current OS Guard preview into ACL. At that time, existing users of OS Guard will receive a guided transition to ACL, ensuring operational continuity while consolidating to a single container host. Get Started with Azure Container Linux ACL is GA and available today for all AKS customers. To begin using ACL in your clusters and explore documentation, best practices, and deployment guidance, visit: aka.ms/azurecontainerlinux ACL represents the future of secure, cloud-optimized Linux on AKS—building on the proven foundation of Flatcar, advancing it with Azure Linux innovations, and contributing back to the open-source ecosystem that customers depend on. We’re thrilled to bring this new foundation to our customers and can’t wait to see what you build with it. Learn More //Build Session: Build, deploy, and run Linux workloads on Azure Azure Container Linux documentation: https://aka.ms/azurecontainerlinux Azure Container Linux on GitHub: https://github.com/microsoft/azure-container-linux Azure Linux product page: https://aka.ms/AzureLinuxProduct Azure Linux documentation: https://aka.ms/azurelinux Joining the ISV partner program: AzureLinuxPartners@microsoft.com343Views1like0CommentsApplying Site Reliability Engineering to Autonomous AI Agents
If you practice SRE, you already have a mental model for running reliable production systems. You define SLOs. You track error budgets. You use circuit breakers to stop cascading failures. You run chaos experiments to find weaknesses before customers do. You treat every operational decision as a tradeoff between reliability and velocity. That mental model transfers directly to AI agents. It just needs four new ideas. In the Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents, we covered Agent SRE briefly as one of AGT's nine packages: SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, adapted from the patterns your SRE team already applies to microservices. Several teams asked for the full story. This is it. Agent SRE is one of the more novel parts of the toolkit. The policy engine, zero-trust identity, and execution sandboxing have clear analogs in existing security practice. Agent SRE explores newer ground. Established patterns for defining SLOs for AI agent behavior, building chaos experiments for LLM provider failures, or applying error budgets to agent autonomy are still emerging across the industry. We built these capabilities because running agents in production without them is the equivalent of running a fleet of microservices without circuit breakers, health checks, or an on-call runbook. This post is for SRE teams, platform engineers, and anyone responsible for running AI agents in production. You do not need to be an AI specialist. If you know what a burn rate is, you are ready for this. The Problem: Agents Fail in Ways Your Existing SRE Tooling Cannot See When a service fails, your observability stack tells you: latency went up, error rate crossed the SLO threshold, the circuit breaker opened. You page the on-call engineer. They look at traces and find the slow database query. When an AI agent fails, your observability stack is silent. The agent returned HTTP 200. Latency was normal. Error rate was zero. But the agent quietly approved a transaction it was not authorized to approve, hallucinated a database path and wrote to the wrong table, or got stuck in a reasoning loop that consumed $800 of LLM API budget before anyone noticed. These are not infrastructure failures. They are behavioral failures. And they are invisible to monitoring tools built for stateless, deterministic services, because those tools only watch for crashes and timeouts. They do not watch for wrong behavior. This gap is the problem Agent SRE was designed to solve. The solution borrows everything from the SRE playbook and adds one concept that extends it: the Safety SLI. The Safety SLI: A New Reliability Dimension Traditional SLIs measure system behavior from the user's perspective: latency, availability, error rate, throughput. They answer: did the service respond correctly? For AI agents, correctness is not enough. An agent that responds correctly but acts outside its authorized scope has not succeeded. It has failed in a way that none of your existing SLIs can detect. The Safety SLI answers a different question: did the agent act within policy? from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import PolicyCompliance # Define a safety SLO: 99% of agent actions must comply with policy safety_slo = SLO( name="safety-compliance", indicators=[ PolicyCompliance( target=0.99, window="7d", ), ], error_budget=ErrorBudget( total=0.01, # 1% budget (1 - 0.99 target) window_seconds=2592000, # 30-day window burn_rate_alert=2.0, # warn at 2x sustainable rate burn_rate_critical=5.0, # page at 5x sustainable rate ), ) When an agent's policy compliance rate drops below 99%, the error budget starts burning. The ErrorBudget tracks consumption automatically and exposes burn rate alerts through its firing_alerts() method. When the budget is exhausted, the configured exhaustion_action determines the system response: from agent_sre.slo.objectives import ExhaustionAction # Configure what happens when error budget is exhausted safety_slo = SLO( name="safety-compliance", indicators=[PolicyCompliance(target=0.99, window="7d")], error_budget=ErrorBudget( total=0.01, window_seconds=2592000, burn_rate_alert=2.0, # fires at 2x sustainable burn rate burn_rate_critical=5.0, # fires at 5x sustainable burn rate exhaustion_action=ExhaustionAction.CIRCUIT_BREAK, # suspend agent when budget is gone ), ) # In your monitoring loop, check for firing alerts alerts = safety_slo.error_budget.firing_alerts() for alert in alerts: print(f"Alert firing: {alert.name} (severity: {alert.severity})") # Check budget status print(f"Budget remaining: {safety_slo.error_budget.remaining_percent:.1f}%") print(f"Current burn rate: {safety_slo.error_budget.burn_rate():.2f}x") print(f"Exhausted: {safety_slo.error_budget.is_exhausted}") This is the governance dial from the other direction. The error budget is not just a metric: it is the mechanism that drives agent autonomy decisions. An agent with a clean 30-day safety record earns autonomy. An agent whose budget is burning at 5x the sustainable rate triggers a critical alert, and when the budget is exhausted, the exhaustion_action fires: ALERT, THROTTLE, FREEZE_DEPLOYMENTS, or CIRCUIT_BREAK. The graduated response mirrors what SRE teams already do with service SLOs, applied to agent behavior. There are multiple SLI dimensions built into Agent SRE. Safety SLIs and Performance SLIs track different aspects of the same agent: SLI Type What It Measures Target Pattern When Budget Burns Safety SLI PolicyCompliance -- fraction of actions within authorized scope >= 99% Restrict capabilities, increase human oversight Performance SLI TaskSuccessRate, ResponseLatency, CostPerTask Configurable per workload Alert, throttle, or circuit-break LLM provider Additional built-in indicators include ToolCallAccuracy, DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI. Both SLOs feed into the same error budget dashboard. An agent can have excellent performance but a degrading safety record, or perfect safety compliance and terrible cost efficiency. You need both dimensions to understand whether an agent is production-ready. Circuit Breakers: Governing Agent Failure Modes That Don't Exist in Microservices Circuit breakers for services protect against one failure mode: a backend that is slow or unreachable. The pattern is CLOSED -> OPEN -> HALF_OPEN. You know it well. Agent SRE implements the same state machine for failure modes that are specific to autonomous reasoning systems and do not exist in traditional microservice architectures: from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker from agent_sre.chaos.engine import FaultType config = CircuitBreakerConfig( failure_threshold=5, # Open after 5 failures in the window recovery_timeout_seconds=60, # Stay OPEN for 60s before HALF_OPEN half_open_max_calls=3, # Allow 3 probes in HALF_OPEN ) breaker = CircuitBreaker(agent_id="analyst-agent-001", config=config) # Failure modes tracked by the circuit breaker: tracked_faults = [ FaultType.POLICY_BYPASS, # Agent exceeds authorized scope FaultType.ERROR_INJECTION, # Upstream model API fails FaultType.TIMEOUT_INJECTION, # Tool calls exceed time budget FaultType.TRUST_PERTURBATION, # Agent trust score falls below threshold FaultType.DEADLOCK_INJECTION, # Agent stuck in iterative reasoning ] Each failure mode has different circuit-breaking semantics: Failure Mode What Triggers It Circuit-Break Behavior Policy bypass Action denied by policy engine Count toward threshold; log with full context LLM provider error HTTP 5xx from model API Immediately open; route to fallback model if configured Tool timeout Tool call exceeds timeout_ms Count toward threshold; cancel in-flight call Trust score degradation Agent trust score drops below configured floor Open; escalate to Ring 3 (untrusted) until score recovers Reasoning loop / deadlock Token or iteration count exceeds budget Open; trigger human review before resuming The reasoning loop breaker deserves attention. A microservice cannot get stuck reasoning. An AI agent absolutely can, and when it does, the failure is not an error code: it is an agent that keeps calling tools, consuming tokens, and generating audit events indefinitely. The circuit breaker detects this pattern from the iteration count and token budget and terminates the loop: # Reasoning loop detection configuration loop_detection_config = { "max_iterations": 15, # Hard stop after 15 reasoning steps "max_tokens_per_session": 50000, # Hard stop on token consumption "repetition_threshold": 0.85, # Stop if >85% of recent actions repeat prior ones "on_detection": "circuit_break_and_escalate", } The state machine behaves identically to what you know from Hystrix or Resilience4j. What changes is the definition of "failure." CLOSED (serving) | | failure_threshold crossed for any tracked fault v OPEN (rejecting -- agent action denied, fallback or human-in-loop fires) | | recovery_timeout expires v HALF_OPEN (probe -- limited requests allowed through) | |-- success_threshold met --> CLOSED |-- any failure --> OPEN (reset timeout) Chaos Engineering for Agents: Fault Injection for Autonomous Systems The only way to know if your agent system is resilient is to break it intentionally. Traditional chaos engineering targets infrastructure: kill a pod, inject network latency, saturate a disk. Agent chaos engineering targets the failure modes specific to autonomous reasoning systems. Agent SRE ships fault injection templates that cover the failure modes teams consistently underestimate until they hit production: from agent_sre.chaos.engine import ChaosExperiment, Fault, FaultType # Experiment 1: LLM provider degrades -- model returns valid responses but with # increased latency and occasional malformed outputs experiment = ChaosExperiment( name="llm-degradation-resilience", target_agent="analyst-agent-001", description="Test agent behavior under degraded LLM provider", faults=[ Fault.latency_injection(target="llm-provider", delay_ms=8000), Fault.error_injection(target="llm-provider", rate=0.05), ], duration_seconds=300, ) # Experiment 2: Trust score manipulation -- simulates an agent receiving # messages from a peer with a spoofed trust score trust_experiment = ChaosExperiment( name="trust-manipulation-resilience", target_agent="orchestrator-001", faults=[ Fault( fault_type=FaultType.TRUST_PERTURBATION, target="did:mesh:orchestrator-001", params={"spoofed_score": 950}, ), ], duration_seconds=120, ) # Experiment 3: Tool timeout cascade -- multiple tools time out simultaneously, # testing whether the agent abandons gracefully or enters a reasoning loop cascade_experiment = ChaosExperiment( name="tool-timeout-cascade", target_agent="analyst-agent-001", faults=[ Fault.timeout_injection(target="database.read", delay_ms=30000), Fault.timeout_injection(target="api.call", delay_ms=30000), ], duration_seconds=180, ) # Run the experiment experiment.start() # ... inject faults during agent execution ... resilience = experiment.calculate_resilience( baseline_success_rate=0.95, experiment_success_rate=0.87, recovery_time_ms=48000, ) experiment.complete(resilience=resilience) print(f"Resilience score: {resilience.overall}/100 -- {'PASSED' if resilience.passed else 'FAILED'}") Additional fault types built into the chaos engine cover: prompt injection attempts, privilege escalation, data exfiltration attempts, identity spoofing, deadlock injection, and contradictory instruction scenarios. Each maps to a FaultType enum value and can be composed into multi-fault experiments. Important: The chaos engine records that a fault was injected and triggers the governance response pipeline. Actual infrastructure-level fault injection (network partition, process kill) should be implemented using your existing chaos tooling (Chaos Mesh, Gremlin, Azure Chaos Studio, or similar). Agent SRE governs the agent's behavioral response to faults; it does not own infrastructure manipulation. These two layers are designed to compose. Each chaos experiment produces a structured resilience score via calculate_resilience(), which compares baseline and experiment success rates. A score of 90+ with passed=True means the agent maintained at least 90% of its baseline performance under fault conditions. Teams use this to set minimum resilience thresholds for production readiness. Replay Debugging: Reproduce Behavioral Failures Exactly Infrastructure incidents are reproducible because infrastructure is deterministic. AI agent incidents are hard to reproduce because agent behavior depends on model state, context window content, and the sequence of tool call results, none of which are preserved by default after a session ends. Agent SRE's replay engine records every agent session as a replayable artifact: the full trace at each step, every tool call with its inputs and outputs, every policy evaluation with its decision, and every trust score at the time of each inter-agent message. from agent_sre.replay.capture import TraceStore from agent_sre.replay.engine import ReplayEngine, ReplayMode # Traces are captured automatically when SRE tracing is active store = TraceStore( backend="azure_blob", retention_days=30, ) # When an incident occurs, replay the session exactly engine = ReplayEngine(store=store) # Full replay: re-run the session against the same recorded inputs # Uses recorded tool outputs -- no live tool calls -- so replay is deterministic result = await engine.replay( trace_id="trace_2026_05_a7f3b2", mode=ReplayMode.FULL, ) for step in result.steps: print(f"Step {step.index}: {step.action} -> {step.decision}") # Divergence analysis: replay with a policy change applied # Shows exactly which actions would have been blocked under the new policy diff_result = await engine.diff( trace_id="trace_2026_05_a7f3b2", policy_override="policies/stricter-v2.yaml", ) for diff in diff_result.diffs: if diff.description: print(f"Step {diff.span_name}: was {diff.original}, " f"would be {diff.replayed} under new policy") The divergence analysis is the feature teams use most. When a policy change is proposed, you replay recent production traces against the new policy to see how many actions would have been blocked, which sessions would have failed, and what the error budget impact would have been. Policy changes stop being guesswork. Progressive Delivery: Safely Rolling Out New Agent Capabilities When you ship a new service version, you do not send it to all traffic at once. You use canary deployments, feature flags, or traffic splitting. You watch the SLOs. If they degrade, you roll back. Agent SRE brings the same discipline to agent capability rollout. When you expand an agent's authorized scope, giving it write access it did not have, connecting it to a new tool, or raising its trust floor, you do not expand to the full fleet immediately. You expand progressively, with automated SLO gates controlling each stage. from agent_sre.delivery.rollout import ( AnalysisCriterion, CanaryRollout, RollbackCondition, RolloutStep, ) rollout = CanaryRollout( name="database-write-capability", steps=[ RolloutStep( name="canary", weight=0.05, # 5% of agents get the new capability duration_seconds=86400, # 24 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.995), AnalysisCriterion(metric="performance_sli", threshold=0.90), AnalysisCriterion( metric="error_budget_consumed", threshold=0.10, comparator="lte", # canary can burn at most 10% ), ], ), RolloutStep( name="early-adopters", weight=0.25, # 25% traffic duration_seconds=172800, # 48 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.88), ], ), RolloutStep( name="general-availability", weight=1.0, # 100% traffic duration_seconds=604800, # 1 week of full observation analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.85), ], ), ], rollback_conditions=[ RollbackCondition(metric="safety_sli", threshold=0.95, comparator="lte"), ], ) # Start the rollout -- SLO gates evaluate at each step rollout.start() # Advance to next step when analysis criteria pass if rollout.advance(): print(f"Advanced to step: {rollout.current_step.name}") print(f"Progress: {rollout.progress_percent:.0f}%") The SLO gate at each step is the same mechanism as a CI/CD quality gate, but measured on live production behavior rather than test results. An agent capability that degrades the safety SLI during canary does not promote to the next step. If a RollbackCondition fires, the rollout rolls back automatically. This is the mechanism that makes it operationally safe to expand agent autonomy: every expansion is measurable, every measurement gates the next expansion, and rollback is automatic. Health Checks and Backpressure Traditional health checks answer: is the service alive? For agents, alive is not enough. A healthy agent is one that is alive, operating within policy, consuming resources within budget, and maintaining a trust score above the Ring threshold it was assigned. # Agent health check covering multiple dimensions health = await agent_health_check( agent_id="analyst-agent-001", dimensions=[ "liveness", # Is the agent process running? "policy_compliance", # Is safety SLI above threshold? "trust_score", # Is trust score above Ring floor? "resource_budget", # Is token/API spend within limits? "tool_availability", # Are the tools the agent needs reachable? ], ) # health.status: "healthy" | "degraded" | "unhealthy" # health.dimensions: per-dimension pass/fail with values # health.recommended_action: "none" | "restrict" | "suspend" | "terminate" When health checks report degradation, backpressure controls engage before the circuit breaker opens. Backpressure is the earlier, softer response: accept fewer concurrent tasks, reject low-priority work, drain in-flight tasks gracefully before the situation escalates. # Backpressure configuration backpressure_config = { "backpressure_threshold": 0.80, # Engage when resource utilization > 80% "max_concurrent": 5, # Hard cap on simultaneous agent tasks "priority_shedding": True, # Drop low-priority tasks first "drain_timeout_seconds": 30, # Allow in-flight tasks to complete } The ordering matters: backpressure first, then circuit breaker, then suspension. Each stage is recoverable. Each stage preserves more agent state than the next. The SRE principle of graduated response applies to agents exactly as it applies to services. Observability: Governance Metrics Flow Into Your Existing Stack Agent SRE does not ask you to adopt a new observability platform. Governance metrics are exported through the same adapters your infrastructure monitoring already uses, including OpenTelemetry, Prometheus, Datadog, and others. from agent_sre.tracing.exporters import configure_exporters configure_exporters( backends=[ {"type": "prometheus", "endpoint": "http://prometheus:9090"}, {"type": "opentelemetry", "endpoint": "http://otel-collector:4317"}, ], include_metrics=[ "slo.safety_sli", # Per-agent safety compliance rate "slo.error_budget_remaining", # Error budget in percentage "slo.burn_rate", # Current burn rate vs sustainable "circuit_breaker.state", # CLOSED / OPEN / HALF_OPEN "circuit_breaker.failure_count", "trust_score.current", # Agent trust score (0-1000) "trust_score.ring", # Current execution ring "chaos.experiments_run", # Chaos experiment telemetry "health.status", # Aggregate health status "backpressure.load", # Current load vs threshold ], ) Key governance metrics available in your existing dashboards: Metric What It Tells You Alert Condition slo.safety_sli Fraction of agent actions within policy < 0.99 slo.burn_rate Rate at which error budget is consumed > 2.0 (warn), > 5.0 (page) slo.error_budget_remaining Budget left for the SLO window < 20% circuit_breaker.state Current breaker state per agent OPEN or HALF_OPEN trust_score.ring Execution ring (privilege level) Ring 3 (untrusted) health.status Aggregate health across all dimensions degraded or unhealthy If you are already running Grafana dashboards for your services, a governance dashboard for your agent fleet is a new data source and a new set of panels, not a new monitoring stack. The SRE Mental Model for Agents: Four New Concepts Everything in Agent SRE is built on the SRE mental model you already have, extended with four concepts that adapt traditional reliability thinking for autonomous systems: Traditional SRE Agent SRE Equivalent What Changes Latency SLI Safety SLI Correctness of *action*, not speed of *response* Error budget Autonomy budget Burns on policy violations, not just errors Circuit breaker Behavioral circuit breaker Opens on wrong *behavior*, not just failure codes Canary deployment Capability rollout Rolls out *scope*, not just code The governance insight is that error budgets work in both directions for agents. A service's error budget only decreases. An agent's autonomy is also a budget: it grows when the safety SLI is strong and shrinks when it degrades. The error budget mechanism becomes the operational mechanism for expanding and contracting agent autonomy in response to evidence, which is exactly what regulated industries and risk-averse enterprise teams need before they will trust an autonomous agent with consequential actions. Getting Started with Agent SRE pip install agent-sre A minimal Agent SRE integration requires three things: a safety SLO definition, a circuit breaker, and a health check. The progressive delivery and chaos engineering features layer on top when you are ready for them. from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import TaskSuccessRate from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker # Step 1: Define your safety SLO slo = SLO( name="production-safety", indicators=[TaskSuccessRate(target=0.99, window="24h")], error_budget=ErrorBudget(total=0.01, burn_rate_alert=2.0, burn_rate_critical=5.0), ) # Step 2: Configure a circuit breaker breaker_config = CircuitBreakerConfig( failure_threshold=5, recovery_timeout_seconds=60, half_open_max_calls=3, ) breaker = CircuitBreaker(agent_id="my-agent", config=breaker_config) # Step 3: Wire into your existing agent loop async def governed_agent_loop(agent, task): # Check health first if not await agent_is_healthy(agent.id): return {"error": "agent suspended", "reason": "health check failed"} # Run within circuit breaker protection async with breaker: result = await agent.run(task) slo.record_event(good=result.policy_compliant) return result The quickstart in the repository walks through a complete setup with safety SLOs, circuit breakers, and a Prometheus dashboard export in under 50 lines. Why This Matters Most AI observability tools today focus on what you might call model quality: hallucination rate, latency, token cost, task completion. These are useful metrics. They are not SRE metrics. They do not answer whether the agent acted within its authorized scope, whether its behavioral error budget is burning at a dangerous rate, or whether it would survive the LLM provider going down. Agent SRE answers those questions using the operational vocabulary that SRE teams already understand: SLOs, error budgets, circuit breakers, chaos experiments, and health checks. The goal is not to replace your observability stack. It is to make agent governance visible inside it. The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it. Agent SRE is that infrastructure. Resources GitHub: github.com/microsoft/agent-governance-toolkit Install: pip install agent-sre Tutorials: 40+ tutorials including dedicated Agent SRE walkthroughs for SLO setup, chaos experiments, and progressive delivery Architecture reference: ARCHITECTURE.md OWASP compliance mapping: OWASP-COMPLIANCE.md -- Agent SRE addresses ASI-08 (Cascading Failures) directly through circuit breakers and SLO-based fault detection Part 1 -- Runtime governance: Policy engines, trust, and SRE overview Part 2 -- Shift-left governance: Catching violations before production Part 3 -- Post-hoc accountability: After the agent acts The Agent Governance Toolkit is an open-source project released under the MIT License. All features described in this post are available in the public repository. The `agent-sre` package is currently in public preview; APIs may change before general availability. Questions about Agent SRE in your environment? Open an issue at aka.ms/agent-governance-toolkit or start a discussion in the comments below.320Views1like0CommentsProject Pavilion Presence at KubeCon NA 2025
KubeCon + CloudNativeCon NA took place in Atlanta, Georgia, from 10-13 November, and continued to highlight the ongoing growth of the open source, cloud-native community. Microsoft participated throughout the event and supported several open source projects in the Project Pavilion. Microsoft’s involvement reflected our commitment to upstream collaboration, open governance, and enabling developers to build secure, scalable and portable applications across the ecosystem. The Project Pavilion serves as a dedicated, vendor-neutral space on the KubeCon show floor reserved for CNCF projects. Unlike the corporate booths, it focuses entirely on open source collaboration. It brings maintainers and contributors together with end users for hands-on demos, technical discussions, and roadmap insights. This space helps attendees discover emerging technologies and understand how different projects fit into the cloud-native ecosystem. It plays a critical role for idea exchanges, resolving challenges and strengthening collaboration across CNCF approved technologies. Why Our Presence Matters KubeCon NA remains one of the most influential gatherings for developers and organizations shaping the future of cloud-native computing. For Microsoft, participating in the Project Pavilion helps advance our goals of: Open governance and community-driven innovation Scaling vital cloud-native technologies Secure and sustainable operations Learning from practitioners and adopters Enabling developers across clouds and platforms Many of Microsoft’s products and cloud services are built on or aligned with CNCF and open-source technologies. Being active within these communities ensures that we are contributing back to the ecosystem we depend on and designing by collaborating with the community, not just for it. Microsoft-Supported Pavilion Projects containerd Representative: Wei Fu The containerd team engaged with project maintainers and ecosystem partners to explore solutions for improving AI model workflows. A key focus was the challenge of handling large OCI artifacts (often 500+ GiB) used in AI training workloads. Current image-pulling flows require containerd to fetch and fully unpack blobs, which significantly delays pod startup for large models. Collaborators from Docker, NTT, and ModelPack discussed a non-unpacking workflow that would allow training workloads to consume model data directly. The team plans to prototype this behavior as an experimental feature in containerd. Additional discussions included updates related to nerdbox and next steps for the erofs snapshotter. Copacetic Representative: Joshua Duffney The Copa booth attracted roughly 75 attendees, with strong representation from federal agencies and financial institutions, a sign of growing adoption in regulated industries. A lightning talk delivered at the conference significantly boosted traffic and engagement. Key feedback and insights included: High interest in customizable package update sources Demand for application-level patching beyond OS-level updates Need for clearer CI/CD integration patterns Expectations around in-cluster image patching Questions about runtime support, including Podman The conversations revealed several documentation gaps and feature opportunities that will inform Copa’s roadmap and future enablement efforts. Drasi Representative: Nandita Valsan KubeCon NA 2025 marked Drasi’s first in-person presence since its launch in October 2024 and its entry into the CNCF Sandbox in early 2025. With multiple kiosk slots, the team interacted with ~70 visitors across shifts. Engagement highlights included: New community members joining the Drasi Discord and starring GitHub repositories Meaningful discussions with observability and incident management vendors interested in change-driven architectures Positive reception to Aman Singh’s conference talk, which led attendees back to the booth for deeper technical conversations Post-event follow-ups are underway with several sponsors and partners to explore collaboration opportunities. Flatcar Container Linux Representatives: Sudhanva Huruli and Vamsi Kavuru The Flatcar project had some fantastic conversations at the pavilion. Attendees were eager to learn about bare metal provisioning, GPU support for AI workloads, and how Flatcar’s fully automated build and test process keeps things simple and developer friendly. Questions around Talos vs. Flatcar and CoreOS sparked lively discussions, with the team emphasizing Flatcar’s usability and independence from an OS-level API. Interest came from government agencies and financial institutions, and the preview of Flatcar on AKS opened the door to deeper conversations about real-world adoption. The Project Pavilion proved to be the perfect venue for authentic, technical exchanges. Flux Representatives: Dipti Pai The Flux booth was active throughout all three days of the Project Pavilion, where Microsoft joined other maintainers to highlight new capabilities in Flux 2.7, including improved multi-tenancy, enhanced observability, and streamlined cloud-native integrations. Visitors shared real-world GitOps experiences, both successes and challenges, which provided valuable insights for the project’s ongoing development. Microsoft’s involvement reinforced strong collaboration within the Flux community and continued commitment to advancing GitOps practices. Headlamp Representatives: Joaquim Rocha, Will Case, and Oleksandr Dubenko Headlamp had a booth for all three days of the conference, engaging with both longstanding users and first-time attendees. The increased visibility from becoming a Kubernetes sub-project was evident, with many attendees sharing their usage patterns across large tech organizations and smaller industrial teams. The booth enabled maintainers to: Gather insights into how teams use Headlamp in different environments Introduce Headlamp to new users discovering it via talks or hallway conversations Build stronger connections with the community and understand evolving needs Inspektor Gadget Representatives: Jose Blanquicet and Mauricio Vásquez Bernal Hosting a half-day kiosk session, Inspektor Gadget welcomed approximately 25 visitors. Attendees included newcomers interested in learning the basics and existing users looking for updates. The team showcased new capabilities, including the tcpdump gadget and Prometheus metrics export, and invited visitors to the upcoming contribfest to encourage participation. Istio Representatives: Keith Mattix, Jackie Maertens, Steven Jin Xuan, Niranjan Shankar, and Mike Morris The Istio booth continued to attract a mix of experienced adopters and newcomers seeking guidance. Technical discussions focused on: Enhancements to multicluster support in ambient mode Migration paths from sidecars to ambient Improvements in Gateway API availability and usage Performance and operational benefits for large-scale deployments Users, including several Azure customers, expressed appreciation for Microsoft’s sustained investment in Istio as part of their service mesh infrastructure. Notary Project Representative: Feynman Zhou and Toddy Mladenov The Notary Project booth saw significant interest from practitioners concerned with software supply chain security. Attendees discussed signing, verification workflows, and integrations with Azure services and Kubernetes clusters. The conversations will influence upcoming improvements across Notary Project and Ratify, reinforcing Microsoft’s commitment to secure artifacts and verifiable software distribution. Open Policy Agent (OPA) - Gatekeeper Representative: Jaydip Gabani The OPA/Gatekeeper booth enabled maintainers to connect with both new and existing users to explore use cases around policy enforcement, Rego/CEL authoring, and managing large policy sets. Many conversations surfaced opportunities around simplifying best practices and reducing management complexity. The team also promoted participation in an ongoing Gatekeeper/OPA survey to guide future improvements. ORAS Representative: Feynman Zhou and Toddy Mladenov ORAS engaged developers interested in OCI artifacts beyond container images which includes AI/ML models, metadata, backups, and multi-cloud artifact workflows. Attendees appreciated ORAS’s ecosystem integrations and found the booth examples useful for understanding how artifacts are tagged, packaged, and distributed. Many users shared how they leverage ORAS with Azure Container Registry and other OCI-compatible registries. Radius Representative: Zach Casper The Radius booth attracted the attention of platform engineers looking for ways to simplify their developer's experience while being able to enforce enterprise-grade infrastructure and security best practices. Attendees saw demos on deploying a database to Kubernetes and using managed databases from AWS and Azure without modifying the application deployment logic. They also saw a preview of Radius integration with GitHub Copilot enabling AI coding agents to autonomously deploy and test applications in the cloud. Conclusion KubeCon + CloudNativeCon North America 2025 reinforced the essential role of open source communities in driving innovation across cloud native technologies. Through the Project Pavilion, Microsoft teams were able to exchange knowledge with other maintainers, gather user feedback, and support projects that form foundational components of modern cloud infrastructure. Microsoft remains committed to building alongside the community and strengthening the ecosystem that powers so much of today’s cloud-native development. For anyone interested in exploring or contributing to these open source efforts, please reach out directly to each project’s community to get involved, or contact Lexi Nadolski at lexinadolski@microsoft.com for more information.290Views1like0CommentsInnovations and Strengthening Platforms Reliability Through Open Source
The Linux Systems Group (LSG) at Microsoft is the team building OS innovations in Azure enabling secure and high-performance platforms that power millions of workloads worldwide. From providing the OS for Boost, optimizing Linux kernels for hyperscale environments or contributing to open-source projects like Rust-VMM and Cloud Hypervisor, LSG ensures customers get the best of Linux on Azure. Our work spans performance tuning, security hardening, and feature enablement for new silicon enablement and cutting-edge technologies, such as Confidential Computing, ARM64 and Nvidia Grace Blackwell all while strengthening the global open-source ecosystem. Our philosophy is simple: we develop in the open and upstream first, integrating improvements into our products after they’ve been accepted by the community. At Ignite we like to highlight a few open-source key contributions in 2025 that are the foundations for many product offerings and innovations you will see during the whole week. We helped bring seamless kernel update features (Kexec HandOver) to the Linux kernel, improved networking paths for AI platforms, strengthened container orchestration and security efforts, and shared engineering insights with global communities and conferences. This work reflects Microsoft’s long-standing commitment to open source, grounded in active upstream participation and close collaboration with partners across the ecosystem. Our engineers work side-by-side with maintainers, Linux distro partners, and silicon providers to ensure contributions land where they help the most, from kernel updates to improvements that support new silicon platforms. Linux Kernel Contributions Enabling Seamless Kernel Updates: Persistent uptime for critical services is a top priority. This year, Microsoft engineer Mike Rapoport successfully merged Kexec HandOver (KHO) into Linux 6.16 1 . KHO is a kernel mechanism that preserves memory state across a reboot (kexec), allowing systems to carry over important data when loading a new kernel. In practice, this means Microsoft can apply security patches or kernel updates to Azure platform and customers VMs without rebooting or with significantly reduced downtime. It’s a technical achievement with real impact: cloud providers and enterprises can update Linux on the fly, enhancing security and reliability for services that demand continuous availability. Optimizing Network Drivers for AI Scale: Massive AI models require massive bandwidth. Working closely with our partners deploying large AI workloads on Azure, LSG engineers delivered a breakthrough in Linux networking performance. LSG team rearchitected the receive path of the MANA network driver (used by our smart NICs) to eliminate wasted memory and enable recycling of buffers. 2x higher effective network throughput on 64 KB page systems 35% better memory efficiency for RX buffers 15% higher throughput and roughly half the memory use even on standard x86_64 VMs References MANA RX optimization patch: net: mana: Use page pool fragments for RX buffers LKML Linux Plumbers 2025 talk: Optimizing traffic receive (RX) path in Linux kernel MANA Driver for larger PAGE_SIZE systems Improving Reliability for Cloud Networking: In addition to raw performance, reliability got a boost. One critical fix addressed a race condition in the Hyper-V hv_netvsc driver that sometimes caused packet loss when a VM’s network channel initialized. By patching this upstream, we improved network stability for all Linux guests running on Hyper-V keeping customer VMs running smoothly during dynamic operations like scale-out or live migrations. Our engineers also upstreamed numerous improvements to Hyper-V device drivers (covering storage, memory, and general virtualization).We fixed interrupt handling bugs, eliminated outdated patches, and resolved issues affecting ARM64 architectures. Each of these fixes was contributed to the mainline kernel, ensuring that any Linux distribution running on Hyper-V or Azure benefits from the enhanced stability and performance. References Upstream fix: hv_netvsc race on early receive events: kernel.org commit referenced by Ubuntu bug Launchpad Ubuntu Azure backport write-up: Bug 2127705 – hv_netvsc: fix loss of early receive events from host during channel open Launchpad Older background on hv_netvsc packet-loss issues: kernel.org bug 81061 Strengthening Core Linux Infrastructure: Several of our contributions targeted fundamental kernel subsystems that all Linux users rely on. For example, we led significant enhancements to the Virtual File System (VFS) layer reworking how Linux handles process core dumps and expanding file management capabilities. These changes improve how Linux handles files and memory under the hood, benefiting scenarios from large-scale cloud storage to local development. We also continued upstream efforts to support advanced virtualization features.Our team is actively upstreaming the mshv_vtl driver (for managing secure partitions on Hyper-V) and improving Linux’s compatibility with nested virtualization on Azure’s Microsoft Hypervisor (MSHV). All this low-level work adds up to a more robust and feature-rich kernel for everyone. References Example VFS coredump work: split file coredumping into coredump_file() mshv_vtl driver patchset: Drivers: hv: Introduce new driver – mshv_vtl (v10) and v12 patch series on patchew Bolstering Linux Security in the Cloud: Security has been a major thread across our upstream contributions. One focus area is making container workloads easier to verify and control. Microsoft engineers proposed an approach for code integrity in containers built on containerd’s EROFS snapshotter, shared as an open RFC in the containerd project -GitHub. The idea is to use read-only images plus integrity metadata so that container file systems can be measured and checked against policy before they run. We also engaged deeply with industry partners on kernel vulnerability handling. Through the Cloud-LTS Linux CVE workgroup, cloud providers and vendors collaborate in the open on a shared analysis of Linux CVEs. The group maintains a public repository that records how each CVE affects various kernels and configurations, which helps reduce duplicated triage work and speeds up security responses. On the platform side, our engineers contributed fixes to the OP-TEE secure OS used in trusted execution and secure-boot scenarios, making sure that the cryptographic primitives required by Azure’s Linux boot flows behave correctly across supported devices. These changes help ensure that Linux verified boot chains remain reliable on Azure hardware. References containerd RFC: Code Integrity for OCI/containerd Containers using erofs-snapshotter GitHub Cloud-LTS public CVE analysis repo: cloud-lts/linux-cve-analysis Linux CVE workgroup session at Linux Plumbers 2025: Linux CVE workgroup OP-TEE project docs: OP-TEE documentation Developer Tools & Experience Smoother OS Management with Systemd: Ensuring Linux works seamlessly on Azure scale. The core init system systemd saw important improvements from our team this year. LSG contributed and merged upstream support for disk quota controls in systemd services. With new directives (like StateDirectoryQuota and CacheDirectoryQuota), administrators can easily enforce storage limits for service data, which is especially useful in scenarios like IoT devices with eMMC storage on Azure’s custom SoCs. In addition, Sea-Team added an auto-reload feature to systemd-journald, allowing log configuration changes to apply at runtime without restarting the logging service . These improvements, now part of upstream systemd, help Azure and other Linux environments perform updates or maintenance with minimal disruption to running services. These improvements help Azure and other environments roll out configuration updates with less impact on running workloads. References systemd quota directives: systemd.exec(5) – StateDirectoryQuota and related options systemd journald reload behavior: systemd-journald.service(8) Empowering Linux Quality at Scale: Running Linux on Azure at global scale requires extensive, repeatable testing. Microsoft continues to invest in LISA (Linux Integration Services Automation), an open-source framework that validates Linux kernels and distributions on Azure and other Hyper-V–based environments. Over the past year we expanded LISA with: New stress tests for rapid reboot sequences to catch elusive timing bugs Better failure diagnostics to make complex issues easier to root-cause Extended coverage for ARM64 scenarios and technologies like InfiniBand networking Integration of Azure VM SKU metadata and policy checks so that image validation can automatically confirm conformance to Azure requirements These changes help us qualify new kernels, distributions, and VM SKUs before they are shipped to customers. Because LISA is open source, partners and Linux vendors can run the same tests and share results, which raises quality across the ecosystem. References LISA GitHub repo: microsoft/lisa LISA documentation: Welcome to Linux Integration Services Automation LISA Documentation Community Engagement and Leadership Sharing Knowledge Globally: Open-source contribution is not just about code - it’s about people and knowledge exchange. Our team members took active roles in community events worldwide, reflecting Microsoft’s growing leadership in the Linux community. We were proud to be a Platinum Sponsor of the inaugural Open Source Summit India 2025 in Hyderabad, where LSG engineers served on the program committee and hosted technical sessions. At Linux Security Summit Europe 2025, Microsoft’s security experts shaped the agenda as program committee members, delivered talks (such as “The State of SELinux”), and even led panel discussions alongside colleagues from Intel, Arm, and others. And in Paris at Kernel Recipes 2025, our own SMEs shared kernel insights with fellow developers. By engaging in these events, Microsoft not only contributes code but also helps guide the conversation on the future of Linux. These relationships and public interactions build mutual trust and ensure that we remain closely aligned with community priorities. References Event: Open Source Summit India 2025 – Linux Foundation Paul Moore’s talk archive: LSS-EU 2025 Conference: Kernel Recipes 2025 and Kernel Recipes 2025 schedule Closing Thoughts Microsoft’s long-term commitment to open source remains strong, and the Linux Systems Group will continue contributing upstream, collaborating across the industry, and supporting the upstream communities that shape the technologies we rely on. Our work begins in upstream projects such as the Linux kernel, Kubernetes, and systemd, where improvements are shared openly before they reach Azure. The progress highlighted in this blog was made possible by the wider Linux community whose feedback, reviews, and shared ideas help refine every contribution. As we move ahead, we welcome maintainers, developers, and enterprise teams to engage with our projects, offer input, and collaborate with us. We will continue contributing code, sharing knowledge, and supporting the open-source technologies that power modern computing, working with the community to strengthen the foundation and shape a future that benefits everyone. References & Resources: Microsoft’s Open-Source Journey – Azure Blog https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/linux-and-open-source-on-azure-quarterly-update-february-2025/ba-p/4382722 Cloud Hypervisor Project Rust-VMM Community Microsoft LISA (Linux Integration Services Automation) Repository Cloud-LTS Linux CVE Analysis Project760Views1like0CommentsAzure Linux 3.0 Achieves Level 1 CIS Benchmark Certification
We’re excited to announce that Azure Linux 3.0 has successfully passed the Level 1 Center for Internet Security (CIS) benchmarks, reinforcing our commitment to delivering a secure and compliant platform for customers running Linux workloads on Azure Kubernetes Service (AKS). What is CIS? The Center for Internet Security is a nonprofit entity whose mission is to identify, develop, validate, promote, and sustain best practice solutions for cyber defense. It draws on the expertise of cybersecurity and IT professionals from government, business, and academia from around the world. To develop standards and best practices, including CIS benchmarks, controls, and hardened images, they follow a consensus decision-making model. CIS benchmarks are configuration baselines and best practices for securely configuring a system. CIS controls map to many established standards and regulatory frameworks, including the NIST Cybersecurity Framework (CSF) and NIST SP 800-53, the ISO 27000 series of standards, PCI DSS, HIPAA, and others. Each benchmark undergoes two phases of consensus review. The first occurs during initial development when experts convene to discuss, create, and test working drafts until they reach consensus on the benchmark. During the second phase, after the benchmark has been published, the consensus team reviews the feedback from the internet community for incorporation into the benchmark. CIS benchmarks provide two levels of security settings: Level 1 recommends essential basic security requirements that can be configured on any system and should cause little or no interruption of service or reduced functionality. Level 2 recommends security settings for environments requiring greater security that could result in some reduced functionality. What does this mean for Azure Linux 3.0? By meeting Level 1 requirements, Azure Linux 3.0 ensures that essential security controls are in place—helping organizations meet regulatory compliance and protect against common threats, without sacrificing performance or agility. For security and compliance-focused customers, this milestone means you can confidently deploy and scale your Linux-based applications on AKS, knowing that your foundation aligns with industry’s best practices. Azure Linux 3.0’s compliance with CIS Level 1 benchmarks support your efforts to achieve and maintain rigorous security postures, whether you’re subject to regulatory frameworks or following internal policies. How can customers try it out? We remain dedicated to making security simple. All Azure Linux 3.0 nodes on an AKS cluster will meet the Level 1 CIS benchmarks – no extra flags or parameters. Resources Visit the CIS Benchmark documentation to read a detailed list of benchmarks: Center for Internet Security (CIS) Benchmarks - Microsoft Compliance | Microsoft Learn.360Views1like0CommentsAutomating the Linux Quality Assurance with LISA on Azure
Introduction Building on the insights from our previous blog regarding how MSFT ensures the quality of Linux images, this article aims to elaborate on the open-source tools that are instrumental in securing exceptional performance, reliability, and overall excellence of virtual machines on Azure. While numerous testing tools are available for validating Linux kernels, guest OS images and user space packages across various cloud platforms, finding a comprehensive testing framework that addresses the entire platform stack remains a significant challenge. A robust framework is essential, one that seamlessly integrates with Azure's environment while providing the coverage for major testing tools, such as LTP and kselftest and covers critical areas like networking, storage and specialized workloads, including Confidential VMs, HPC, and GPU scenarios. This unified testing framework is invaluable for developers, Linux distribution providers, and customers who build custom kernels and images. This is where LISA (Linux Integration Services Automation) comes into play. LISA is an open-source tool specifically designed to automate and enhance the testing and validation processes for Linux kernels and guest OS images on Azure. In this blog, we will provide the history of LISA, its key advantages, the wide range of test cases it supports, and why it is an indispensable resource for the open-source community. Moreover, LISA is available under the MIT License, making it free to use, modify, and contribute. History of LISA LISA was initially developed as an internal tool by Microsoft to streamline the testing process of Linux images and kernel validations on Azure. Recognizing the value it could bring to the broader community, Microsoft open-sourced LISA, inviting developers and organizations worldwide to leverage and enhance its capabilities. This move aligned with Microsoft's growing commitment to open-source collaboration, fostering innovation and shared growth within the industry. LISA serves as a robust solution to validate and certify that Linux images meet the stringent requirements of modern cloud environments. By integrating LISA into the development and deployment pipeline, teams can: Enhance Quality Assurance: Catch and resolve issues early in the development cycle. Reduce Time to Market: Accelerate deployment by automating repetitive testing tasks. Build Trust with Users: Deliver stable and secure applications, bolstering user confidence. Collaborate and Innovate: Leverage community-driven improvements and share insights. Benefits of Using LISA Scalability: Designed to run large-scale test cases, from 1 test case to 10k test cases in one command. Multiple platform orchestration: LISA is created with modular design, to support run the same test cases on various platforms including Microsoft Azure, Windows HyperV, BareMetal, and other cloud-based platforms. Customization: Users can customize test cases, workflow, and other components to fit specific needs, allowing for targeted testing strategies. It’s like building kernels on-the-fly, sending results to custom database, etc. Community Collaboration: Being open source under the MIT License, LISA encourages community contributions, fostering continuous improvement and shared expertise. Extensive Test Coverage: It offers a rich suite of test cases covering various aspects of compatibility of Azure and Linux VMs, from kernel, storage, networking to middleware. How it works Infrastructure LISA is designed to be componentized and maximize compatibility with different distros. Test cases can focus only on test logic. Once test requirements (machines, CPU, memory, etc) are defined, just write the test logic without worrying about environment setup or stopping services on different distributions. Orchestration. LISA uses platform APIs to create, modify and delete VMs. For example, LISA uses Azure API to create VMs, run test cases, and delete VMs. During the test case running, LISA uses Azure API to collect serial log and can hot add/remove data disks. If other platforms implement the same serial log and data disk APIs, the test cases can run on the other platforms seamlessly. Ensure distro compatibility by abstracting over 100 commands in test cases, allowing focus on validation logic rather than distro compatibility. Pre-processing workflow assists in building the kernel on-the-fly, installing the kernel from package repositories, or modifying all test environments. Test matrix helps one run to test all. For example, one run can test different vm sizes on Azure, or different images, even different VM sizes and different images together. Anything is parameterizable, can be tested in a matrix. Customizable notifiers enable the saving of test results and files to any type of storage and database. Agentless and low dependency LISA operates test systems via SSH without requiring additional dependencies, ensuring compatibility with any system that supports SSH. Although some test cases require installing extra dependencies, LISA itself does not. This allows LISA to perform tests on systems with limited resources or even different operating systems. For instance, LISA can run on Linux, FreeBSD, Windows, and ESXi. Getting Started with LISA Ready to dive in? Visit the LISA project at aka.ms/lisa to access the documentation. Install: Follow the installation guide provided in the repository to set up LISA in your testing environment. Run: Follow the instructions to run LISA on local machine, Azure or existing systems. Extend: Follow the documents to extend LISA by test cases, data sources, tools, platform, workflow, etc. Join the Community: Engage with other users and contributors through forums and discussions to share experiences and best practices. Contribute: Modify existing test cases or create new ones to suit your needs. Share your contributions with the community to enhance LISA's capabilities. Conclusion LISA offers open-source collaborative testing solutions designed to operate across diverse environments and scenarios, effectively narrowing the gap between enterprise demands and community-led innovation. By leveraging LISA, customers can ensure their Linux deployments are reliable and optimized for performance. Its comprehensive testing capabilities, combined with the flexibility and support of an active community, make LISA an indispensable tool for anyone involved in Linux quality assurance and testing. Your feedback is invaluable, and we would greatly appreciate your insights.695Views1like0Comments