red hat

33 Topics

Applying Site Reliability Engineering to Autonomous AI Agents
If you practice SRE, you already have a mental model for running reliable production systems. You define SLOs. You track error budgets. You use circuit breakers to stop cascading failures. You run chaos experiments to find weaknesses before customers do. You treat every operational decision as a tradeoff between reliability and velocity. That mental model transfers directly to AI agents. It just needs four new ideas. In the Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents, we covered Agent SRE briefly as one of AGT's nine packages: SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, adapted from the patterns your SRE team already applies to microservices. Several teams asked for the full story. This is it. Agent SRE is one of the more novel parts of the toolkit. The policy engine, zero-trust identity, and execution sandboxing have clear analogs in existing security practice. Agent SRE explores newer ground. Established patterns for defining SLOs for AI agent behavior, building chaos experiments for LLM provider failures, or applying error budgets to agent autonomy are still emerging across the industry. We built these capabilities because running agents in production without them is the equivalent of running a fleet of microservices without circuit breakers, health checks, or an on-call runbook. This post is for SRE teams, platform engineers, and anyone responsible for running AI agents in production. You do not need to be an AI specialist. If you know what a burn rate is, you are ready for this. The Problem: Agents Fail in Ways Your Existing SRE Tooling Cannot See When a service fails, your observability stack tells you: latency went up, error rate crossed the SLO threshold, the circuit breaker opened. You page the on-call engineer. They look at traces and find the slow database query. When an AI agent fails, your observability stack is silent. The agent returned HTTP 200. Latency was normal. Error rate was zero. But the agent quietly approved a transaction it was not authorized to approve, hallucinated a database path and wrote to the wrong table, or got stuck in a reasoning loop that consumed $800 of LLM API budget before anyone noticed. These are not infrastructure failures. They are behavioral failures. And they are invisible to monitoring tools built for stateless, deterministic services, because those tools only watch for crashes and timeouts. They do not watch for wrong behavior. This gap is the problem Agent SRE was designed to solve. The solution borrows everything from the SRE playbook and adds one concept that extends it: the Safety SLI. The Safety SLI: A New Reliability Dimension Traditional SLIs measure system behavior from the user's perspective: latency, availability, error rate, throughput. They answer: did the service respond correctly? For AI agents, correctness is not enough. An agent that responds correctly but acts outside its authorized scope has not succeeded. It has failed in a way that none of your existing SLIs can detect. The Safety SLI answers a different question: did the agent act within policy? from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import PolicyCompliance # Define a safety SLO: 99% of agent actions must comply with policy safety_slo = SLO( name="safety-compliance", indicators=[ PolicyCompliance( target=0.99, window="7d", ), ], error_budget=ErrorBudget( total=0.01, # 1% budget (1 - 0.99 target) window_seconds=2592000, # 30-day window burn_rate_alert=2.0, # warn at 2x sustainable rate burn_rate_critical=5.0, # page at 5x sustainable rate ), ) When an agent's policy compliance rate drops below 99%, the error budget starts burning. The ErrorBudget tracks consumption automatically and exposes burn rate alerts through its firing_alerts() method. When the budget is exhausted, the configured exhaustion_action determines the system response: from agent_sre.slo.objectives import ExhaustionAction # Configure what happens when error budget is exhausted safety_slo = SLO( name="safety-compliance", indicators=[PolicyCompliance(target=0.99, window="7d")], error_budget=ErrorBudget( total=0.01, window_seconds=2592000, burn_rate_alert=2.0, # fires at 2x sustainable burn rate burn_rate_critical=5.0, # fires at 5x sustainable burn rate exhaustion_action=ExhaustionAction.CIRCUIT_BREAK, # suspend agent when budget is gone ), ) # In your monitoring loop, check for firing alerts alerts = safety_slo.error_budget.firing_alerts() for alert in alerts: print(f"Alert firing: {alert.name} (severity: {alert.severity})") # Check budget status print(f"Budget remaining: {safety_slo.error_budget.remaining_percent:.1f}%") print(f"Current burn rate: {safety_slo.error_budget.burn_rate():.2f}x") print(f"Exhausted: {safety_slo.error_budget.is_exhausted}") This is the governance dial from the other direction. The error budget is not just a metric: it is the mechanism that drives agent autonomy decisions. An agent with a clean 30-day safety record earns autonomy. An agent whose budget is burning at 5x the sustainable rate triggers a critical alert, and when the budget is exhausted, the exhaustion_action fires: ALERT, THROTTLE, FREEZE_DEPLOYMENTS, or CIRCUIT_BREAK. The graduated response mirrors what SRE teams already do with service SLOs, applied to agent behavior. There are multiple SLI dimensions built into Agent SRE. Safety SLIs and Performance SLIs track different aspects of the same agent: SLI Type What It Measures Target Pattern When Budget Burns Safety SLI PolicyCompliance -- fraction of actions within authorized scope >= 99% Restrict capabilities, increase human oversight Performance SLI TaskSuccessRate, ResponseLatency, CostPerTask Configurable per workload Alert, throttle, or circuit-break LLM provider Additional built-in indicators include ToolCallAccuracy, DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI. Both SLOs feed into the same error budget dashboard. An agent can have excellent performance but a degrading safety record, or perfect safety compliance and terrible cost efficiency. You need both dimensions to understand whether an agent is production-ready. Circuit Breakers: Governing Agent Failure Modes That Don't Exist in Microservices Circuit breakers for services protect against one failure mode: a backend that is slow or unreachable. The pattern is CLOSED -> OPEN -> HALF_OPEN. You know it well. Agent SRE implements the same state machine for failure modes that are specific to autonomous reasoning systems and do not exist in traditional microservice architectures: from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker from agent_sre.chaos.engine import FaultType config = CircuitBreakerConfig( failure_threshold=5, # Open after 5 failures in the window recovery_timeout_seconds=60, # Stay OPEN for 60s before HALF_OPEN half_open_max_calls=3, # Allow 3 probes in HALF_OPEN ) breaker = CircuitBreaker(agent_id="analyst-agent-001", config=config) # Failure modes tracked by the circuit breaker: tracked_faults = [ FaultType.POLICY_BYPASS, # Agent exceeds authorized scope FaultType.ERROR_INJECTION, # Upstream model API fails FaultType.TIMEOUT_INJECTION, # Tool calls exceed time budget FaultType.TRUST_PERTURBATION, # Agent trust score falls below threshold FaultType.DEADLOCK_INJECTION, # Agent stuck in iterative reasoning ] Each failure mode has different circuit-breaking semantics: Failure Mode What Triggers It Circuit-Break Behavior Policy bypass Action denied by policy engine Count toward threshold; log with full context LLM provider error HTTP 5xx from model API Immediately open; route to fallback model if configured Tool timeout Tool call exceeds timeout_ms Count toward threshold; cancel in-flight call Trust score degradation Agent trust score drops below configured floor Open; escalate to Ring 3 (untrusted) until score recovers Reasoning loop / deadlock Token or iteration count exceeds budget Open; trigger human review before resuming The reasoning loop breaker deserves attention. A microservice cannot get stuck reasoning. An AI agent absolutely can, and when it does, the failure is not an error code: it is an agent that keeps calling tools, consuming tokens, and generating audit events indefinitely. The circuit breaker detects this pattern from the iteration count and token budget and terminates the loop: # Reasoning loop detection configuration loop_detection_config = { "max_iterations": 15, # Hard stop after 15 reasoning steps "max_tokens_per_session": 50000, # Hard stop on token consumption "repetition_threshold": 0.85, # Stop if >85% of recent actions repeat prior ones "on_detection": "circuit_break_and_escalate", } The state machine behaves identically to what you know from Hystrix or Resilience4j. What changes is the definition of "failure." CLOSED (serving) | | failure_threshold crossed for any tracked fault v OPEN (rejecting -- agent action denied, fallback or human-in-loop fires) | | recovery_timeout expires v HALF_OPEN (probe -- limited requests allowed through) | |-- success_threshold met --> CLOSED |-- any failure --> OPEN (reset timeout) Chaos Engineering for Agents: Fault Injection for Autonomous Systems The only way to know if your agent system is resilient is to break it intentionally. Traditional chaos engineering targets infrastructure: kill a pod, inject network latency, saturate a disk. Agent chaos engineering targets the failure modes specific to autonomous reasoning systems. Agent SRE ships fault injection templates that cover the failure modes teams consistently underestimate until they hit production: from agent_sre.chaos.engine import ChaosExperiment, Fault, FaultType # Experiment 1: LLM provider degrades -- model returns valid responses but with # increased latency and occasional malformed outputs experiment = ChaosExperiment( name="llm-degradation-resilience", target_agent="analyst-agent-001", description="Test agent behavior under degraded LLM provider", faults=[ Fault.latency_injection(target="llm-provider", delay_ms=8000), Fault.error_injection(target="llm-provider", rate=0.05), ], duration_seconds=300, ) # Experiment 2: Trust score manipulation -- simulates an agent receiving # messages from a peer with a spoofed trust score trust_experiment = ChaosExperiment( name="trust-manipulation-resilience", target_agent="orchestrator-001", faults=[ Fault( fault_type=FaultType.TRUST_PERTURBATION, target="did:mesh:orchestrator-001", params={"spoofed_score": 950}, ), ], duration_seconds=120, ) # Experiment 3: Tool timeout cascade -- multiple tools time out simultaneously, # testing whether the agent abandons gracefully or enters a reasoning loop cascade_experiment = ChaosExperiment( name="tool-timeout-cascade", target_agent="analyst-agent-001", faults=[ Fault.timeout_injection(target="database.read", delay_ms=30000), Fault.timeout_injection(target="api.call", delay_ms=30000), ], duration_seconds=180, ) # Run the experiment experiment.start() # ... inject faults during agent execution ... resilience = experiment.calculate_resilience( baseline_success_rate=0.95, experiment_success_rate=0.87, recovery_time_ms=48000, ) experiment.complete(resilience=resilience) print(f"Resilience score: {resilience.overall}/100 -- {'PASSED' if resilience.passed else 'FAILED'}") Additional fault types built into the chaos engine cover: prompt injection attempts, privilege escalation, data exfiltration attempts, identity spoofing, deadlock injection, and contradictory instruction scenarios. Each maps to a FaultType enum value and can be composed into multi-fault experiments. Important: The chaos engine records that a fault was injected and triggers the governance response pipeline. Actual infrastructure-level fault injection (network partition, process kill) should be implemented using your existing chaos tooling (Chaos Mesh, Gremlin, Azure Chaos Studio, or similar). Agent SRE governs the agent's behavioral response to faults; it does not own infrastructure manipulation. These two layers are designed to compose. Each chaos experiment produces a structured resilience score via calculate_resilience(), which compares baseline and experiment success rates. A score of 90+ with passed=True means the agent maintained at least 90% of its baseline performance under fault conditions. Teams use this to set minimum resilience thresholds for production readiness. Replay Debugging: Reproduce Behavioral Failures Exactly Infrastructure incidents are reproducible because infrastructure is deterministic. AI agent incidents are hard to reproduce because agent behavior depends on model state, context window content, and the sequence of tool call results, none of which are preserved by default after a session ends. Agent SRE's replay engine records every agent session as a replayable artifact: the full trace at each step, every tool call with its inputs and outputs, every policy evaluation with its decision, and every trust score at the time of each inter-agent message. from agent_sre.replay.capture import TraceStore from agent_sre.replay.engine import ReplayEngine, ReplayMode # Traces are captured automatically when SRE tracing is active store = TraceStore( backend="azure_blob", retention_days=30, ) # When an incident occurs, replay the session exactly engine = ReplayEngine(store=store) # Full replay: re-run the session against the same recorded inputs # Uses recorded tool outputs -- no live tool calls -- so replay is deterministic result = await engine.replay( trace_id="trace_2026_05_a7f3b2", mode=ReplayMode.FULL, ) for step in result.steps: print(f"Step {step.index}: {step.action} -> {step.decision}") # Divergence analysis: replay with a policy change applied # Shows exactly which actions would have been blocked under the new policy diff_result = await engine.diff( trace_id="trace_2026_05_a7f3b2", policy_override="policies/stricter-v2.yaml", ) for diff in diff_result.diffs: if diff.description: print(f"Step {diff.span_name}: was {diff.original}, " f"would be {diff.replayed} under new policy") The divergence analysis is the feature teams use most. When a policy change is proposed, you replay recent production traces against the new policy to see how many actions would have been blocked, which sessions would have failed, and what the error budget impact would have been. Policy changes stop being guesswork. Progressive Delivery: Safely Rolling Out New Agent Capabilities When you ship a new service version, you do not send it to all traffic at once. You use canary deployments, feature flags, or traffic splitting. You watch the SLOs. If they degrade, you roll back. Agent SRE brings the same discipline to agent capability rollout. When you expand an agent's authorized scope, giving it write access it did not have, connecting it to a new tool, or raising its trust floor, you do not expand to the full fleet immediately. You expand progressively, with automated SLO gates controlling each stage. from agent_sre.delivery.rollout import ( AnalysisCriterion, CanaryRollout, RollbackCondition, RolloutStep, ) rollout = CanaryRollout( name="database-write-capability", steps=[ RolloutStep( name="canary", weight=0.05, # 5% of agents get the new capability duration_seconds=86400, # 24 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.995), AnalysisCriterion(metric="performance_sli", threshold=0.90), AnalysisCriterion( metric="error_budget_consumed", threshold=0.10, comparator="lte", # canary can burn at most 10% ), ], ), RolloutStep( name="early-adopters", weight=0.25, # 25% traffic duration_seconds=172800, # 48 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.88), ], ), RolloutStep( name="general-availability", weight=1.0, # 100% traffic duration_seconds=604800, # 1 week of full observation analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.85), ], ), ], rollback_conditions=[ RollbackCondition(metric="safety_sli", threshold=0.95, comparator="lte"), ], ) # Start the rollout -- SLO gates evaluate at each step rollout.start() # Advance to next step when analysis criteria pass if rollout.advance(): print(f"Advanced to step: {rollout.current_step.name}") print(f"Progress: {rollout.progress_percent:.0f}%") The SLO gate at each step is the same mechanism as a CI/CD quality gate, but measured on live production behavior rather than test results. An agent capability that degrades the safety SLI during canary does not promote to the next step. If a RollbackCondition fires, the rollout rolls back automatically. This is the mechanism that makes it operationally safe to expand agent autonomy: every expansion is measurable, every measurement gates the next expansion, and rollback is automatic. Health Checks and Backpressure Traditional health checks answer: is the service alive? For agents, alive is not enough. A healthy agent is one that is alive, operating within policy, consuming resources within budget, and maintaining a trust score above the Ring threshold it was assigned. # Agent health check covering multiple dimensions health = await agent_health_check( agent_id="analyst-agent-001", dimensions=[ "liveness", # Is the agent process running? "policy_compliance", # Is safety SLI above threshold? "trust_score", # Is trust score above Ring floor? "resource_budget", # Is token/API spend within limits? "tool_availability", # Are the tools the agent needs reachable? ], ) # health.status: "healthy" | "degraded" | "unhealthy" # health.dimensions: per-dimension pass/fail with values # health.recommended_action: "none" | "restrict" | "suspend" | "terminate" When health checks report degradation, backpressure controls engage before the circuit breaker opens. Backpressure is the earlier, softer response: accept fewer concurrent tasks, reject low-priority work, drain in-flight tasks gracefully before the situation escalates. # Backpressure configuration backpressure_config = { "backpressure_threshold": 0.80, # Engage when resource utilization > 80% "max_concurrent": 5, # Hard cap on simultaneous agent tasks "priority_shedding": True, # Drop low-priority tasks first "drain_timeout_seconds": 30, # Allow in-flight tasks to complete } The ordering matters: backpressure first, then circuit breaker, then suspension. Each stage is recoverable. Each stage preserves more agent state than the next. The SRE principle of graduated response applies to agents exactly as it applies to services. Observability: Governance Metrics Flow Into Your Existing Stack Agent SRE does not ask you to adopt a new observability platform. Governance metrics are exported through the same adapters your infrastructure monitoring already uses, including OpenTelemetry, Prometheus, Datadog, and others. from agent_sre.tracing.exporters import configure_exporters configure_exporters( backends=[ {"type": "prometheus", "endpoint": "http://prometheus:9090"}, {"type": "opentelemetry", "endpoint": "http://otel-collector:4317"}, ], include_metrics=[ "slo.safety_sli", # Per-agent safety compliance rate "slo.error_budget_remaining", # Error budget in percentage "slo.burn_rate", # Current burn rate vs sustainable "circuit_breaker.state", # CLOSED / OPEN / HALF_OPEN "circuit_breaker.failure_count", "trust_score.current", # Agent trust score (0-1000) "trust_score.ring", # Current execution ring "chaos.experiments_run", # Chaos experiment telemetry "health.status", # Aggregate health status "backpressure.load", # Current load vs threshold ], ) Key governance metrics available in your existing dashboards: Metric What It Tells You Alert Condition slo.safety_sli Fraction of agent actions within policy < 0.99 slo.burn_rate Rate at which error budget is consumed > 2.0 (warn), > 5.0 (page) slo.error_budget_remaining Budget left for the SLO window < 20% circuit_breaker.state Current breaker state per agent OPEN or HALF_OPEN trust_score.ring Execution ring (privilege level) Ring 3 (untrusted) health.status Aggregate health across all dimensions degraded or unhealthy If you are already running Grafana dashboards for your services, a governance dashboard for your agent fleet is a new data source and a new set of panels, not a new monitoring stack. The SRE Mental Model for Agents: Four New Concepts Everything in Agent SRE is built on the SRE mental model you already have, extended with four concepts that adapt traditional reliability thinking for autonomous systems: Traditional SRE Agent SRE Equivalent What Changes Latency SLI Safety SLI Correctness of *action*, not speed of *response* Error budget Autonomy budget Burns on policy violations, not just errors Circuit breaker Behavioral circuit breaker Opens on wrong *behavior*, not just failure codes Canary deployment Capability rollout Rolls out *scope*, not just code The governance insight is that error budgets work in both directions for agents. A service's error budget only decreases. An agent's autonomy is also a budget: it grows when the safety SLI is strong and shrinks when it degrades. The error budget mechanism becomes the operational mechanism for expanding and contracting agent autonomy in response to evidence, which is exactly what regulated industries and risk-averse enterprise teams need before they will trust an autonomous agent with consequential actions. Getting Started with Agent SRE pip install agent-sre A minimal Agent SRE integration requires three things: a safety SLO definition, a circuit breaker, and a health check. The progressive delivery and chaos engineering features layer on top when you are ready for them. from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import TaskSuccessRate from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker # Step 1: Define your safety SLO slo = SLO( name="production-safety", indicators=[TaskSuccessRate(target=0.99, window="24h")], error_budget=ErrorBudget(total=0.01, burn_rate_alert=2.0, burn_rate_critical=5.0), ) # Step 2: Configure a circuit breaker breaker_config = CircuitBreakerConfig( failure_threshold=5, recovery_timeout_seconds=60, half_open_max_calls=3, ) breaker = CircuitBreaker(agent_id="my-agent", config=breaker_config) # Step 3: Wire into your existing agent loop async def governed_agent_loop(agent, task): # Check health first if not await agent_is_healthy(agent.id): return {"error": "agent suspended", "reason": "health check failed"} # Run within circuit breaker protection async with breaker: result = await agent.run(task) slo.record_event(good=result.policy_compliant) return result The quickstart in the repository walks through a complete setup with safety SLOs, circuit breakers, and a Prometheus dashboard export in under 50 lines. Why This Matters Most AI observability tools today focus on what you might call model quality: hallucination rate, latency, token cost, task completion. These are useful metrics. They are not SRE metrics. They do not answer whether the agent acted within its authorized scope, whether its behavioral error budget is burning at a dangerous rate, or whether it would survive the LLM provider going down. Agent SRE answers those questions using the operational vocabulary that SRE teams already understand: SLOs, error budgets, circuit breakers, chaos experiments, and health checks. The goal is not to replace your observability stack. It is to make agent governance visible inside it. The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it. Agent SRE is that infrastructure. Resources GitHub: github.com/microsoft/agent-governance-toolkit Install: pip install agent-sre Tutorials: 40+ tutorials including dedicated Agent SRE walkthroughs for SLO setup, chaos experiments, and progressive delivery Architecture reference: ARCHITECTURE.md OWASP compliance mapping: OWASP-COMPLIANCE.md -- Agent SRE addresses ASI-08 (Cascading Failures) directly through circuit breakers and SLO-based fault detection Part 1 -- Runtime governance: Policy engines, trust, and SRE overview Part 2 -- Shift-left governance: Catching violations before production Part 3 -- Post-hoc accountability: After the agent acts The Agent Governance Toolkit is an open-source project released under the MIT License. All features described in this post are available in the public repository. The `agent-sre` package is currently in public preview; APIs may change before general availability. Questions about Agent SRE in your environment? Open an issue at aka.ms/agent-governance-toolkit or start a discussion in the comments below.
mosiddi
May 19, 2026 Place Linux and Open Source Blog
40Views
0likes
0Comments
After the Agent Acts: Proving What Happened and Who Authorized It
In part one of this series, we covered AGT's runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. In part two, we moved earlier in the lifecycle: shift-left governance, CI/CD gates, attestation workflows, and supply chain integrity. Both posts focused on governance that happens around the moment of action, before it, during it, or right after it. That coverage is essential. But after those posts went live, a different pattern emerged in conversations with teams deploying agents in production. The question was more pointed: "An agent executed a financial transfer last Tuesday. A compliance officer is asking us to show who authorized it, through what chain, and exactly what scope it was granted. We have logs. But can we prove they weren't altered?" No policy engine prevents a past action. No CI gate reconstructs a delegation chain after the fact. No shift-left tool tells an auditor whether the cryptographic identity that authorized a trade was legitimately derived from a human principal, or was injected mid-chain. This is the accountability gap. It is the governance question that neither runtime enforcement nor pre-runtime checks were designed to answer. Regulatory frameworks are tightening: the EU AI Act includes high-risk obligations with enforcement timelines in 2026, and the Colorado AI Act introduces requirements for automated decision-making. Courts are beginning to encounter AI agents in the evidentiary record. The accountability infrastructure has not caught up. This post covers what post-hoc accountability means for autonomous agents, what the Agent Governance Toolkit has to help address it, and three value propositions that are real but not yet visible in how governance tooling is typically described. Note: The policy files, workflow configurations, and code samples in this post are illustrative examples designed to show the concepts. For working implementations, see the QUICKSTART.md in the repository. The Accountability Gap in Multi-Agent Systems The accountability problem is architectural. When a single agent takes a single action, accountability is straightforward: you know which model ran, what prompt it received, and what it called. When agents delegate to sub-agents, which delegate further to tool-execution agents, the chain of authorization becomes progressively disconnected from the original human instruction that started it. Consider this delegation topology, common in any production orchestration scenario: Human Principal └── Orchestrator Agent (did:mesh:orchestrator-001) └── Data Analyst Agent (did:mesh:analyst-001) └── File Write Tool (write /reports/q3-summary.csv) By the time file_write fires, three delegation hops have occurred. The file write tool has no reliable way to know whether the human principal actually authorized file writes, what scope they granted to the orchestrator, or whether the analyst agent's instructions arrived through a legitimate delegation or were injected by a prompt injection attack. This gap has three concrete consequences: Consequence Operational Impact Post-hoc audits cannot reconstruct authorization Incident investigations are limited to "the agent did this," not "here is who authorized this, through what chain, at what time, with what scope" Agents cannot distinguish legitimate delegation from injection A prompt injection attack that inserts itself into a delegation chain is indistinguishable from a real orchestrator instruction without cryptographic verification Accountability cannot be attributed to a human authorization event When a regulator asks "who is responsible for this action," the answer is a shrug and a log file AGT already has the technical foundations designed to help close all three. The gap is not capability, it is visibility. What AGT Has: The Cryptographic Accountability Stack AGT's accountability infrastructure spans three components that work together: cryptographic agent identity, delegation chains, and tamper-evident audit logs. 1. Ed25519 Agent Identity with Lifecycle Management Every agent in an AGT-governed system carries a cryptographic identity: a verifiable Ed25519 keypair with a W3C DID Document that can be exported, shared, and verified by any participant in the system. from agentmesh import AgentIdentity, IdentityRegistry # Create a verifiable agent identity identity = AgentIdentity.create( name="data-analyst", sponsor="operator@contoso.com", capabilities=["data.read", "report.write"], organization="data-team", description="Q3 close data analyst agent" ) # Export as W3C DID Document for cross-system verification did_document = identity.to_did_document() # Register in the shared identity registry registry = IdentityRegistry() registry.register(identity) Identity lifecycle states, active, suspended, revoked, are tracked and cascaded. When an orchestrator identity is revoked, every downstream agent delegated from it is also invalidated. This cascade revocation behavior lets you kill a compromised delegation chain from its root rather than hunting sub-agents individually. 2. Delegation Chains with Scope Inheritance When an orchestrator delegates to a sub-agent, AGT records the delegation cryptographically: who delegated, to whom, what capabilities were transferred, and what restrictions were applied. Sub-agents are designed to be unable to exceed the scope of their delegating principal. from agentmesh import ScopeChain, DelegationLink # Create a scope chain rooted in a human sponsor chain, root_link = ScopeChain.create_root( sponsor_email="operator@contoso.com", root_agent_did=str(orchestrator_identity.did), capabilities=["data.read", "report.write", "data.delete"], sponsor_verified=True, ) # Orchestrator delegates narrowed scope to analyst agent link = DelegationLink( link_id="link-analyst-001", depth=1, parent_did=str(orchestrator_identity.did), child_did=str(analyst_identity.did), parent_capabilities=["data.read", "report.write", "data.delete"], delegated_capabilities=["data.read", "report.write"], # narrowed: no delete parent_signature=orchestrator_identity.sign( f"{orchestrator_identity.did}:{analyst_identity.did}:data.read,report.write".encode() ), link_hash="", # computed on add previous_link_hash=root_link.link_hash, ) link.link_hash = link.compute_hash() chain.add_link(link) # Verify the entire chain: scope narrowing + hash integrity + signatures valid, reason = chain.verify() if not valid: raise ValueError(f"Chain verification failed: {reason}") The scope chain carries the human authorization context: the root sponsor email, when the chain was created, and what capabilities were granted at the top. Every downstream agent can trace any capability back through the chain using chain.trace_capability("data.read"). A file write tool executing three hops from the human principal can verify that the original sponsor authorized file writes in this scope. This is the mechanism designed to help close the prompt injection gap: an injected instruction cannot produce a valid signed delegation link from a legitimate orchestrator identity. 3. Tamper-Evident Audit Logs Every policy decision, every delegation event, every tool call, every trust score evaluation: AGT writes a signed, append-only audit record. The signature covers the content hash of the log entry plus the hash of the preceding entry, forming a chain where tampering is designed to be detectable. from agentmesh import PolicyEngine, AuditLog # Create the audit log (with optional external sink for production) audit_log = AuditLog() # Log a governance decision entry = audit_log.log( event_type="policy_decision", agent_did=str(analyst_identity.did), action="report.write", resource="/reports/q3-summary.csv", data={"task_id": "q3-close-2026"}, outcome="success", policy_decision="allow", ) # Verify the audit chain has not been tampered with valid, reason = audit_log.verify_chain() # valid == True: all hashes and chain links are intact # Query audit trail for a specific agent trail = audit_log.get_entries_for_agent(str(analyst_identity.did)) The audit trail for a single task session includes the complete delegation chain, from human authorization event at the top to tool execution at the bottom, with cryptographic signatures at every step. Validating a Compliance Evidence Package The three components above are most powerful when used together. At runtime, AGT's audit chain, identity registry, and delegation system each produce structured records. Assembling these into a single evidence package for compliance submission or incident investigation is a deployment-level concern: your CI pipeline or orchestration layer collects the outputs into a JSON artifact. Once assembled, AGT's agt verify --evidence flag validates the package: checking that signatures are intact, delegation chains are complete, and audit entries have not been tampered with. # Validate a runtime evidence package agt verify --evidence ./agt-evidence.json # Strict mode: fail if evidence is missing, incomplete, or signatures don't verify agt verify --evidence ./agt-evidence.json --strict Future direction: A built-in agt evidence collect command to automate evidence assembly is on the backlog. The evidence package helps answer the audit questions directly: Auditor Question Where It Lives in the Evidence Package Which agent executed this action? identity.agent_id with Ed25519 public key Who authorized it? delegation_chain[0].human_principal with timestamp What scope was granted? delegation_chain[*].granted_capabilities at each hop Was the delegation legitimate? delegation_chain[*].signature, verifiable against issuer's public key Was the audit log altered? audit_trail.chain_valid: true/false with entry-level hash verification What policy governed the action? policy_decision.rule_name with the policy YAML snapshot at decision time This is the difference between "we have logs" and "here is a verifiable chain of custody backed by cryptographic signatures." The Governance Dial: Enabling Autonomy, Not Just Blocking Risk There is a framing problem in how agent governance is typically described. Governance is described almost entirely as a constraint: what agents cannot do, what gets blocked, what violations get caught. This framing is accurate but incomplete. Governance is the mechanism that helps you safely expand what your agents can do. Without governance evidence, every expansion of agent autonomy is a leap of faith. With it, expansions are decisions with a measured risk profile: Scenario Without Governance Evidence With AGT Accountability Stack Expand agent to write to production databases Requires human approval on every write indefinitely Pilot with human-in-loop for 500 writes; audit trail shows 0 violations; graduate to autonomous Deploy agent in a regulated data environment Blocked by legal until "we can prove it" Evidence package helps satisfy audit requirement; deployment proceeds Respond to a security incident involving an agent Manually reconstruct what happened from scattered logs Pull the task session's evidence package; full chain of custody in minutes The governance layer is the dial between supervised and autonomous operation. Audit evidence is what helps justify turning the dial further in the autonomous direction. Blast Radius: The Governance Assurance You're Not Advertising The sandboxing and privilege ring system in AGT is typically described in security terms: isolation, privilege reduction, process-level enforcement. But there is a more concrete operational value: blast radius definition before an incident occurs. The question every operations team needs to answer before deploying an autonomous agent at scale is: *"If this agent goes wrong, not if, when, what is the worst-case outcome?"* Without governance-enforced privilege boundaries, the answer is uncomfortably open-ended. With AGT's capability model and execution rings, the blast radius is a policy configuration: a bounded, declared set of resources the agent can touch, scoped to what the task requires. # policies/financial-agent.yaml apiVersion: governance.toolkit/v1 version: "1.0" name: financial-agent-policy default_action: deny rules: - name: allow-report-write condition: "tool_name == 'report.write' and path.startswith('/data/reports/')" action: allow priority: 10 - name: allow-data-read condition: "tool_name == 'data.read' and path.startswith('/data/processed/')" action: allow priority: 10 With this policy in place, the worst-case outcome for this agent is declared in the policy file, not discovered during a post-incident review. The audit log records not just what the agent did, but also every action that was blocked, giving you a full picture of how close any session came to the declared blast boundary. Regulatory Alignment The OWASP-COMPLIANCE.md in the AGT repository maps the toolkit's controls to each of the 10 OWASP Agentic AI risks. The compliance picture for specific regulatory frameworks: Regulatory Requirement Relevant Framework AGT Control Technical documentation for high-risk AI EU AI Act, Art. 9-11 Evidence package, policy audit trail, OWASP attestation Logging for automated decisions EU AI Act, Art. 12 Tamper-evident audit log with entry-level signatures Human oversight mechanisms EU AI Act, Art. 14 Circuit breakers, privilege rings, delegation scope limits Algorithmic impact assessment Colorado AI Act Policy snapshot at decision time, signed governance evidence Audit trail for automated decisions HIPAA, SOC 2 Type II Immutable audit log with W3C DID-based agent identity Non-repudiation of agent actions Financial services (MiFID II, SEC) Ed25519-signed audit entries, delegation chain with human auth context Note: The Agent Governance Toolkit does not guarantee compliance with any specific regulatory framework. The mappings above show how the toolkit's controls align with common requirements. Consult legal counsel for your specific obligations. Putting It Together The three posts in this series cover three distinct layers of the governance lifecycle: Layer Timing Primary Value Post Shift-left governance Before production Catch policy violations at commit, PR, and CI time Part 2 Runtime governance At the moment of action Deterministic policy enforcement, zero-trust identity, sandboxing Part 1 Post-hoc accountability After the action Cryptographic chain of custody, blast radius evidence, regulatory proof This post None of these layers substitutes for the others. Pre-runtime governance cannot prevent a runtime violation. Runtime enforcement cannot retroactively prove authorization. Post-hoc accountability cannot undo an action that runtime governance should have blocked. They compose. Getting Started If you already have the AGT policy engine in place, the path to full accountability coverage is incremental: Add agent identity - Create identities for each agent and register them. Export DID documents for cross-service verification. Record delegation tokens - At each orchestrator-to-agent delegation boundary, create and sign a delegation link. Pass tokens as context to the policy engine. Configure a tamper-evident audit backend - Configure the audit chain with a signing key and chain verification. For production, use an immutable backend: Azure Blob with WORM retention, S3 Object Lock, or equivalent. Generate your first evidence package: agt verify --evidence ./agt-evidence.json --strict Add evidence generation to your CI/CD release gate: # .github/workflows/release.yml - name: Governance Evidence Gate uses: microsoft/agent-governance-toolkit/action@<sha> #v3.5.0 with: command: governance-verify evidence-path: ./agt-evidence.json strict: true fail-on-missing-chain: true Conclusion Runtime governance and shift-left governance answer the question: did we apply the right controls? Post-hoc accountability answers the question: can we prove it? The Agent Governance Toolkit has the technical infrastructure designed to help answer it: Ed25519 agent identity with cascade revocation, cryptographically signed delegation chains with human authorization context, and tamper-evident audit logs that form a verifiable chain of custody from human principal to terminal tool call. The governance dial analogy is worth keeping. Every autonomous agent deployment exists on a spectrum between fully supervised and fully autonomous. The limiting factor on where you can set that dial is not model capability or framework maturity. It is how much governance evidence you have, and how verifiable that evidence is. Resources GitHub: microsoft/agent-governance-toolkit: AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10. Quickstart: Quick Start - Agent Governance Toolkit OWASP Compliance Mapping: OWASP Compliance - Agent Governance Toolkit PyPI: pip install agent-governance-toolkit[full] npm: npm install microsoft/agent-governance-sdk NuGet: dotnet add package Microsoft.AgentGovernance Have questions about deploying AGT in your environment? Open an issue at aka.ms/agent-governance-toolkit or join the conversation in the comments below.
mosiddi
May 14, 2026 Place Linux and Open Source Blog
146Views
0likes
0Comments
From Policy to Practice: Built-In CIS Benchmarks on Azure - Flexible, Hybrid-Ready
Security is more important than ever. The industry-standard for secure machine configuration is the Center for Internet Security (CIS) Benchmarks. These benchmarks provide consensus-based prescriptive guidance to help organizations harden diverse systems, reduce risk, and streamline compliance with major regulatory frameworks and industry standards like NIST, HIPAA, and PCI DSS. In our previous post, we outlined our plans to improve the Linux server compliance and hardening experience on Azure and shared a vision for integrating CIS Benchmarks. Today, that vision has turned into reality. We're now announcing the next phase of this work: Center for Internet Security (CIS) Benchmarks are now available on Azure for all Azure endorsed distros, at no additional cost to Azure and Azure Arc customers. With today's announcement, you get access to the CIS Benchmarks on Azure with full parity to what’s published by the Center for Internet Security (CIS). You can adjust parameters or define exceptions, tailoring security to your needs and applying consistent controls across cloud, hybrid, and on-premises environments - without having to implement every control manually. Thanks to this flexible architecture, you can truly manage compliance as code. How we achieve parity To ensure accuracy and trust, we rely on and ingest CIS machine-readable Benchmark content (OVAL/XCCDF files) as the source of truth. This guarantees that the controls and rules you apply in Azure match the official CIS specifications, reducing drift and ensuring compliance confidence. What’s new under the hood At the core of this update is azure-osconfig’s new compliance engine - a lightweight, open-source module developed by the Azure Core Linux team. It evaluates Linux systems directly against industry-standard benchmarks like CIS, supporting both audit and, in the future, auto-remediation. This enables accurate, scalable compliance checks across large Linux fleets. Here you can read more about azure-osconfig. Dynamic rule evaluation The new compliance engine supports simple fact-checking operations, evaluation of logic operations on them (e.g., anyOf, allOf) and Lua based scripting, which allows to express complex checks required by the CIS Critical Security Controls - all evaluated natively without external scripts. Scalable architecture for large fleets When the assignment is created, the Azure control plane instructs the machine to pull the latest Policy package via the Machine Configuration agent. Azure-osconfig’s compliance engine is integrated as a light-weight library to the package and called by Machine Configuration agent for evaluation – which happens every 15-30minutes. This ensures near real-time compliance state without overwhelming resources and enables consistent evaluation across thousands of VMs and Azure Arc-enabled servers. Future-ready for remediation and enforcement While the Public Preview starts with audit-only mode, the roadmap includes per-rule remediation and enforcement using technologies like eBPF for kernel-level controls. This will allow proactive prevention of configuration drift and runtime hardening at scale. Please reach out if you interested in auto-remediation or enforcement. Extensibility beyond CIS Benchmarks The architecture was designed to support other security and compliance standards as well and isn’t limited to CIS Benchmarks. The compliance engine is modular, and we plan to extend the platform with STIG and other relevant industry benchmarks. This positions Azure as a platform for a place where you can manage your compliance from a single control-plane without duplicating efforts elsewhere. Collaboration with the CIS This milestone reflects a close collaboration between Microsoft and the CIS to bring industry-standard security guidance into Azure as a built-in capability. Our shared goal is to make cloud-native compliance practical and consistent, while giving customers the flexibility to meet their unique requirements. We are committed to continuously supporting new Benchmark releases, expanding coverage with new distributions and easing adoption through built-in workflows, such as moving from your current Benchmark version to a new version while preserving your custom configurations. Certification and trust We can proudly announce that azure-osconfig has met all the requirements and is officially certified by the CIS for Benchmark assessment, so you can trust compliance results as authoritative. Minor benchmark updates will be applied automatically, while major version will be released separately. We will include workflows to help migrate customizations seamlessly across versions. Key Highlights Built-in CIS Benchmarks for Azure Endorsed Linux distributions Full parity with official CIS Benchmarks content and certified by the CIS for Benchmark Assessment Flexible configuration: adjust parameters, define exceptions, tune severity Hybrid support: enforce the same baseline across Azure, on-prem, and multi-cloud with Azure Arc Reporting format in CIS tooling style Supported use cases Certified CIS Benchmarks for all Azure Endorsed Distros - Audit only (L1/L2 server profiles) Hybrid / On-premises and other cloud machines with Azure Arc for the supported distros Compliance as Code (example via Github -> Azure OIDC auth and API integration) Compatible with GuestConfig workbook What’s next? Our next mission is to bring the previously announced auto-remediation capability into this experience, expand the distribution coverage and elevate our workflows even further. We’re focused on empowering you to resolve issues while honoring the unique operational complexity of your environments. Stay tuned! Get Started Documentation link for this capability Enable CIS Benchmarks in Machine Configuration and select the “Official Center for Internet Security (CIS) Benchmarks for Linux Workloads” then select the distributions for your assignment, and customize as needed. In case if you want any additional distribution supported or have any feedback for azure-osconfig – please open an Azure support case or a Github issue here Relevant Ignite 2025 session: Hybrid workload compliance from policy to practice on Azure Connect with us at Ignite Meet the Linux team and stop by the Linux on Azure booth to see these innovations in action: Session Type Session Code Session Name Date/Time (PST) Theatre THR 712 Hybrid workload compliance from policy to practice on Azure Tue, Nov 18/ 3:15 PM – 3:45 PM Breakout BRK 143 Optimizing performance, deployments, and security for Linux on Azure Thu, Nov 20/ 1:00 PM – 1:45 PM Breakout BRK 144 Build, modernize, and secure AKS workloads with Azure Linux Wed, Nov 19/ 1:30 PM – 2:15 PM Breakout BRK 104 From VMs and containers to AI apps with Azure Red Hat OpenShift Thu, Nov 20/ 8:30 AM – 9:15 AM Theatre THR 701 From Container to Node: Building Minimal-CVE Solutions with Azure Linux Wed, Nov 19/ 3:30 PM – 4:00 PM Lab Lab 505 Fast track your Linux and PostgreSQL migration with Azure Migrate Tue, Nov 18/ 4:30 PM – 5:45 PM PST Wed, Nov 19/ 3:45 PM – 5:00 PM PST Thu, Nov 20/ 9:00 AM – 10:15 AM PST
pallakatos
Nov 18, 2025 Place Linux and Open Source Blog
1.3KViews
0likes
0Comments
Dalec: Declarative Package and Container Builds
Build once, deploy everywhere. From a single YAML specification, Dalec produces native Linux packages (RPM, DEB) and container images - no Dockerfiles, no complex RPM spec or control files, just declarative configuration. Dalec, a Cloud Native Computing Foundation (CNCF) Sandbox project, is a Docker BuildKit frontend that enables users to build system packages and container images from declarative YAML specifications. As a BuildKit frontend, Dalec integrates directly into the Docker build process, requiring no additional tools beyond Docker itself.
SertacOzercan
Oct 29, 2025 Place Linux and Open Source Blog
542Views
0likes
0Comments
Red Hat Enterprise Linux Software Reservations Now Available
What's New After careful collaboration with Red Hat, we've updated our RHEL pay-as-you-go billing meters and reservation pricing to align with Red Hat's latest pricing model. These updates address previous billing meter issues and ensure an accurate, transparent experience for our customers. Starting today, customers can purchase RHEL software reservations on Azure with updated pricing, allowing organizations to reduce their Linux workload costs with the flexibility and reliability of software reservations. Key Benefits of RHEL Software Reservations Cost savings: Save up to 24% compared to pay-as-you-go pricing by committing to a one-year term Predictable costs: Lock in pricing for your RHEL workloads and optimize your cloud budget Pricing clarity: Updated meters aligned with Red Hat's current pricing model Seamless Azure integration: Manage your RHEL software reservations alongside other Azure resources. RHEL software reservations allow you to pre-purchase RHEL software capacity at a discounted rate, delivering significant savings over standard pay-as-you-go pricing. Get Started Today RHEL software reservations are available now on the Azure portal. To learn more about pricing, terms, and how to purchase them, visit the following pages: Pricing - Linux Virtual Machines | Microsoft Azure Prepay for software plans - Azure Reservations - Azure Virtual Machines | Microsoft Learn Red Hat reservation plan discounts - Azure - Microsoft Cost Management | Microsoft Learn What are Azure Reservations? - Microsoft Cost Management | Microsoft Learn We're committed to providing transparent, reliable billing for all Azure services and appreciate your continued partnership as we deliver the best cloud platform for your open-source workloads. For questions or support, please contact Azure Support or your Microsoft account team.
abbottkarl
Oct 27, 2025 Place Linux and Open Source Blog
885Views
1like
0Comments
The Open-Source Paradox: How Microsoft is giving back
Microsoft is sponsoring All Things Open 2025 to address open-source sustainability by showcasing solutions and fostering community collaboration.
LachlanEvenson
Oct 02, 2025 Place Linux and Open Source Blog
787Views
3likes
0Comments
Red Hat Enterprise Linux Billing Meter ID Updates on Azure
Background Following Red Hat's transition to a scalable vCPU-based pricing model in early 2024, Microsoft Azure implemented corresponding pricing updates for RHEL services. As part of ensuring our billing infrastructure fully supports this new pricing structure, we are updating the billing meter IDs for Red Hat Enterprise Linux services to provide better alignment with the current pricing model. What's Changing As part of optimizing our billing infrastructure to fully support Red Hat's updated pricing model, Microsoft Azure is updating the billing meter IDs for Red Hat Enterprise Linux (RHEL) services. These changes ensure our billing system properly accommodates the vCPU-based pricing structure and Reserved Instance functionality. Effective Date: The meter ID migration will begin rolling out on October 2-8, 2025. Impact on Customers Pay-As-You-Go (PAYG) Customers Billing Continuity: Your RHEL instances will continue to operate without any service interruption Meter ID Changes: New meter IDs will appear in your usage reports and cost management statements starting Oct 2-8, 2025 Usage Tracking: Historical usage data remains accessible through existing billing reports Important: No Pricing Impact This meter ID migration has no impact on pricing. Azure already implemented the RHEL pricing changes in early 2024. This meter ID update is purely a technical billing system change and does not affect your costs in any way. For pricing information, see the Pricing - Red Hat Virtual Machines | Microsoft Azure page. Action Required: Update Your Billing Tools To ensure uninterrupted monitoring and reporting, we recommend reviewing and updating the following: Billing Reports and Dashboards Cost Management Reports: Update any custom reports that filter by specific RHEL meter IDs Power BI Dashboards: Refresh data connections and update meter ID filters Third-party Tools: Verify that cost management tools recognize the new meter IDs Budget Alerts and Monitoring Budget Configurations: Review budgets that target specific RHEL services and update meter ID filters Cost Alerts: Ensure automated cost alerts include both old and new meter IDs during the transition period Spending Thresholds: Update any spending controls that rely on meter ID identification Automated Tools and Scripts Cost Analysis APIs: Update API calls that reference specific RHEL meter IDs Automation Scripts: Modify any scripts that process billing data using meter ID filters Integration Tools: Update third-party integrations that rely on meter ID mapping Transition Period Support During the migration period, you may see both old and new meter IDs in your statements. This is expected behavior as the transition occurs gradually across different regions and services. Timeline: October 2-8, 2025: New meter IDs begin appearing for RHEL Pay as you go meters Transition Period: Both old and new meter IDs may appear in billing during migration Completion: Migration will be completed across all regions and services Affected RHEL Meters The following RHEL meter names will be updated with new meter IDs as part of this migration: 1 vCPU VM License 1 vCPU VM License - Free 2 vCPU VM License 4 vCPU VM License 6 vCPU VM License 8 vCPU VM License 12 vCPU VM License 16 vCPU VM License 20 vCPU VM License 24 vCPU VM License 32 vCPU VM License 40 vCPU VM License 44 vCPU VM License 48 vCPU VM License 52-vCPU VM License 60 vCPU VM License 64 vCPU VM License 72 vCPU VM License 80 vCPU VM License 96 vCPU VM License 120 vCPU VM License 128 vCPU VM License 144-vCPU VM License 192 vCPU VM License 192-vCPU VM License 208 vCPU VM License 416 vCPU VM License 420-vCPU VM License These meter changes are tied to the vCPU count, not the VM family - meaning any VM with the same vCPU count (whether D family, E family, etc.) uses the same RHEL license meter. For example, all 2 vCPU VMs use the "2 vCPU VM License" meter regardless of the VM family. If your billing reports, budgets, or automated tools currently filter or reference any of these specific meter names, you'll need to update them to include the new meter IDs once the migration begins. Support and Resources Need Help? Azure Support: Contact Azure Support for technical assistance with billing or meter ID questions Account Team: Reach out to your Microsoft account team for guidance on updating enterprise billing processes Additional Resources: RHEL pricing update details: Red Hat Enterprise Linux Pricing update Azure cost optimization resources can be found here Frequently Asked Questions Q: Will this affect my RHEL instances or their performance? A: No, this is purely a billing system update. Your RHEL instances will continue to operate normally with no performance impact. Q: How can I identify the new meter IDs in my billing? A: New meter IDs will appear in your Cost Management portal and usage reports and cost management statements starting October 2-8, 2025. Watch for updated RHEL service descriptions in your billing details. Q: What if my automated tools stop working? A: Contact Azure Support for assistance. We recommend updating your tools proactively using the meter ID mapping reference provided. We're committed to making this transition as smooth as possible. If you have questions or need assistance, please reach out to Azure Support or your Microsoft account team.
abbottkarl
Aug 28, 2025 Place Linux and Open Source Blog
889Views
0likes
0Comments
Red Hat Enterprise Linux 10 Image Mode on Azure Quick Start Guide
This guide will get you up and running with Image Mode on RHEL on Azure to help demonstrate how the technology works and how it can save you time and effort in managing your RHEL estate on Azure. To get started, you’ll need the following: Container Registry Linux server for building containers Azure Subscription for deploying a RHEL VM in Image Mode On an existing Linux system with “az” installed (and logged into an Azure account), let’s go ahead and begin by creating a resource group: az group create -l eastus -n linux-vms Now on that same system, let’s generate a ssh key for logging into vms: ssh-keygen -f ~/.ssh/linuxkey -t rsa And hit enter at both passphrase prompts. This will generate an ssh key for the purpose of this quickstart. In production, you may wish to actually use a passphrase to better protect your ssh keys. Now, we need to create a container registry. To do this on Azure, let’s go to https://portal.azure.com/#create/Microsoft.ContainerRegistry and on the first screen, specify an active subscription and choose “linux-vms” as our resource group. For the name of the registry, it must be unique across the Azure namespace, so try something with your name. In my case, my name is Karl and so I’m going to use “rhel10demo” for the purpose of this quick start. Anywhere you see that name, please replace it with your own registry name. Your first page should look similar to this when completed: Once you’ve filled those values in, click “Review + Create” and then on the next screen, click “Create”. Once the deployment finishes, you will have a container registry ready to go. Back over on our Linux system, let’s now deploy to Azure a Linux server for building containers: az vm create \ --resource-group linux-vms \ --name container-builder \ --image RedHat:rhel-raw:9_5:9.5.2024120516 \ --admin-username core \ --assign-identity \ --ssh-key-values ~/.ssh/linuxkey.pub\ --public-ip-sku Standard Once this command has completed, you’ll have a public ip address specified as the value for “publicAddress” in the output from the above command. This document will use ip.ip.ip.ip in place of an actual ip address to show you where you’d place the value. We also need to copy over our linuxkey.pub file to the container building machine so that we can inject this key into the container images we build: scp -i ~/.ssh/linuxkey ~/.ssh/linuxkey.pub core@ip.ip.ip.ip:.ssh/ Now let’s get started building our image mode container image! ssh core@ip.ip.ip.ip -i ~/.ssh/linuxkey First, we want to install podman and git: sudo dnf install -y podman git Now we need to have the Azure CLI tooling so that we can interact with our Azure Container Registry, to do that, let’s run: sudo rpm --import https://packages.microsoft.com/keys/microsoft.asc We now need to edit /etc/yum.repos.d/azure-cli.repo and make sure that it has the following contents: [azure-cli] name=Azure CLI baseurl=https://packages.microsoft.com/yumrepos/azure-cli enabled=1 gpgcheck=1 gpgkey=https://packages.microsoft.com/keys/microsoft.asc We can now run: sudo dnf install azure-cli -y Once this package has installed, we need to run: az login And follow the instructions. Let’s now generate a token credential granting us registry access by doing the following: (Remember to replace “rhel10demo” with your own registry name.) az acr token create --name mytoken --resource-group linux-vms --registry rhel10demo --scope-map _repositories_push "creationDate": "2025-04-23T14:02:59.654151+00:00", "credentials": { "certificates": null, "passwords": [ { "creationTime": "2025-04-23T14:03:10.774839+00:00", "expiry": null, "name": "password1", "value": "roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef" }, { "creationTime": "2025-04-23T14:03:10.774858+00:00", "expiry": null, "name": "password2", "value": "vzRJsZwsp4PX/ToXTmAFQitz2yYye+dqHZXbbUXZrD+ACRBPzPXi" } ], "username": "mytoken" }, … In the output, we will see two passwords. Let’s grab password1, which in our example looks like: (You’ll want to save this information as it’s the only time you’ll get to see it!) "passwords": [ { "creationTime": "2025-04-23T14:03:10.774839+00:00", "expiry": null, "name": "password1", "value": "roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef" }, Here the password is the value of the “value” parameter, and we can now log into the registry with the following command: (Remember to replace “rhel10demo” with your registry name.) podman login rhel10demo.azurecr.io -u mytoken -p roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef Next, we want to clone the git repository that has our MSSQL Image Mode example: git clone https://github.com/mrguitar/rhel-mssql-bootc As we are using Azure PAYG (Pay as you Go) images, we need to enable RHUI access inside of containers by running: sudo sh -c "echo -e '/usr/share/rhel/secrets:/run/secrets\n/etc/pki/rhui:/etc/pki/rhui\n/etc/yum.repos.d/rh-cloud-base.repo:/etc/yum.repos.d/rh-cloud-base.repo' >> /etc/containers/mounts.conf" sudo chmod 644 /etc/pki/rhui/{private,product}/* Now, let’s change into the directory of our git repository and copy our auth.json for our container: cd rhel-mssql-bootc mkdir -p etc/ostree cp /run/user/1000/containers/auth.json ./etc/ostree/ We can now examine the Containerfile.azure and see that it has instructions to build and bring up a system. In the from container, we already have MIcrosoft SQL Server installed, so we do not have to add that to our container: FROM quay.io/mrguitar/rhel-mssql-bootc:latest COPY etc/ /etc/ COPY 05-cloud-kargs.toml /usr/lib/bootc/kargs.d/ ARG sshpubkey RUN if test -z "$sshpubkey"; then echo "must provide sshpubkey"; exit 1; fi; \ useradd -G wheel core && \ mkdir -m 0700 -p /home/core/.ssh && \ echo $sshpubkey > /home/core/.ssh/authorized_keys && \ chmod 0600 /home/core/.ssh/authorized_keys && \ chown -R core: /home/core # install required packages and enable services RUN dnf -y install \ WALinuxAgent \ cloud-init \ cloud-utils-growpart \ gdisk \ hyperv-daemons && \ dnf clean all && \ systemctl enable NetworkManager.service && \ systemctl enable waagent.service && \ systemctl enable cloud-init.service && \ echo 'ClientAliveInterval 180' >> /etc/ssh/sshd_config # configure waagent for cloud-init to handle provisioning RUN sed -i 's/Provisioning.Agent=auto/Provisioning.Agent=cloud-init/g' /etc/waagent.conf && \ sed -i 's/ResourceDisk.Format=y/ResourceDisk.Format=n/g' /etc/waagent.conf && \ sed -i 's/ResourceDisk.EnableSwap=y/ResourceDisk.EnableSwap=n/g' /etc/waagent.conf We can now build the container image by running: podman build --build-arg "sshpubkey=$(cat ~/.ssh/linuxkey.pub)" --no-cache -f Containerfile.azure Now that we have our container registry configured and our image built, we can push the image to the registry. Let’s start by finding the image ID for our newly built image: [core@container-builder rhel-mssql-bootc]$ podman images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> ba4f948305db 2 minutes ago 3.64 GB quay.io/mrguitar/rhel-mssql-bootc latest 534ac925516a 4 weeks ago 3.47 GB We can now tag this image with the following command (Remember to replace “rhel10demo” with your registry name.): podman tag ba4f948305db rhel10demo.azurecr.io/mssql-image-mode:azure Now we can push our image to our registry: podman push rhel10demo.azurecr.io/mssql-image-mode:azure Now, let’s exit from the container building server and go create a new RHEL 9.5 VM that we will switch from package mode to image mode. This needs to be run on the system that you created the container building image on. az vm create \ --resource-group linux-vms \ --name mssql \ --image RedHat:rhel-raw:9_5:9.5.2024120516 \ --admin-username core \ --assign-identity \ --ssh-key-values ~/.ssh/linuxkey.pub \ --public-ip-sku Standard Once this machine starts, let’s ssh to the system: ssh -i ~/.ssh/linuxkey core@ip.ip.ip.ip Let’s start by installing podman: sudo dnf install podman -y Now we must login to our container registry using sudo so that podman running as root can download our container image: (We will use the credentials that were generated earlier in this quick start.) sudo podman login rhel10demo.azurecr.io -u mytoken -p roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef and then we can run the following command to pull down our image mode image and deploy it to this system: sudo podman run --privileged --pid=host -v /var/lib/containers:/var/lib/containers -v /:/target -v /home/core/.ssh/authorized_keys:/bootc_authorized_ssh_keys/root rhel10demo.azurecr.io/mssql-image-mode:azure bootc install to-existing-root --acknowledge-destructive --root-ssh-authorized-keys /bootc_authorized_ssh_keys/root At this point, when the above command completes successfully, you can run: sudo systemctl reboot And when you reboot, your RHEL system will now be running in Image Mode! Let’s go ahead and log back in and see what’s different: ssh -i ~/.ssh/linuxkey core@ip.ip.ip.ip Let’s start by trying to create a file in /opt as root – something you could normally do on package mode RHEL: sudo touch /opt/hello.txt If you are on an image mode system, you’ll see this output: touch: cannot touch 'hello.txt': Read-only file system This is because the files in the container image are immutable and the only way to change them is to change the image and reboot into the image! This provides a significant new layer of security in that operating system files cannot be easily modified. You are still able to edit files in /etc and in home directories. This image has Microsoft SQL Server installed and we can verify this by running: sudo systemctl status mssql-server You should see output similar to: ○ mssql-server.service - Microsoft SQL Server Database Engine Loaded: loaded (/usr/lib/systemd/system/mssql-server.service; disabled; preset: disabled) Active: inactive (dead) Docs: https://docs.microsoft.com/en-us/sql/linux Given that we want all virtual machines based on this image to have SQL Server running by default, let’s make that change to the container image. To do this, we’ll need to exit this shell and ssh into the container builder machine from earlier. On that machine, run: cd rhel-mssql-bootc And then use this command to add “RUN systemctl enable mssql-server.service” to the end of Containerfile.azure: echo >> Containerfile.azure && echo "RUN systemctl enable mssql-server.service" >> Containerfile.azure Once this is done, we need to rebuild the image, tag the rebuilt image, and push the new image to our container registry: (Remember to replace “rhel10demo” with your registry name.) podman build --build-arg "sshpubkey=$(cat ~/.ssh/linuxkey.pub)" --no-cache -f Containerfile.azure podman images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> b7322184d64e 9 seconds ago 3.67 GB rhel10demo.azurecr.io/mssql-image-mode azure 237947455bd9 53 minutes ago 3.67 GB quay.io/mrguitar/rhel-mssql-bootc latest e36c92d89714 4 days ago 3.5 GB podman tag b7322184d64e rhel10demo.azurecr.io/mssql-image-mode:azure cp etc/ostree/auth.json /run/user/1000/containers podman push rhel10demo.azurecr.io/mssql-image-mode:azure Once this is done, we can exit this machine and log back on to our image mode machine and run the following: sudo bootc upgrade sudo systemctl reboot And watch the magic as the box reboots! Now when you ssh in again, run: sudo systemctl status mssql-server This time, you should get output showing that SQL Server is running. Upon updating and rebooting all VMs based on this image, they’ll now be running SQL Server. At this point, we can run the SQL Server demo by running: sudo PATH=$PATH:/opt/mssql/bin/ /opt/mssql_demo.sh This demo shows that the SQL Server is working on our RHEL machine running in Image Mode. Any further changes that you want to make to this box can be pushed by changing the container image in the registry and calling a bootc upgrade. Image Mode RHEL also ships with a timer that allows these boxes to check for updates on the weekend. This can be configured so that not all of your image mode RHEL boxes update at the same time. With Image Mode RHEL, you don’t have to worry about system drift as you are always running off of a known image. Also, you’ll probably find yourself saving a lot of time by not having to log on to lots of systems. If you need to spin up 1,000 VMs with the same image, you can easily do that with the tooling we’ve shown you today on top of Azure!
abbottkarl
May 19, 2025 Place Linux and Open Source Blog
768Views
1like
0Comments
Azure Image Testing for Linux (AITL)
As cloud and AI evolve at an unprecedented pace, the need to deliver high-quality, secure, and reliable Linux VM images has never been more essential. Azure Image Testing for Linux (AITL) is a self-service validation tool designed to help developers, ISVs, and Linux distribution partners ensure their images meet Azure’s standards before deployment. With AITL, partners can streamline testing, reduce engineering overhead, and ensure compliance with Azure’s best practices, all in a scalable and automated manner. Let’s explore how AITL is redefining image validation and why it’s proving to be a valuable asset for both developers and enterprises. Before AITL, image validation was largely a manual and repetitive process, engineers were often required to perform frequent checks, resulting in several key challenges: Time-Consuming: Manual validation processes delayed image releases. Inconsistent Validation: Each distro had different methods for testing, leading to varying quality levels. Limited Scalability: Resource constraints restricted the ability to validate a broad set of images. AITL addresses these challenges by enabling partners to seamlessly integrate image validation into their existing pipelines through APIs. By executing tests within their own Azure subscriptions prior to publishing, partners can ensure that only fully validated, high-quality Linux images are promoted to production in the Azure environment. How AITL Works? AITL is powered by LISA, which is a test framework and a comprehensive opensource tool contains 400+ test cases. AITL provides a simple, yet powerful workflow run LISA test cases: Registration: Partners register their images in AITL’s validation framework. Automated Testing: AITL runs a suite of predefined validation tests using LISA. Detailed Reporting: Developers receive comprehensive results highlighting compliance, performance, and security areas. All test logs are available to access. Self-Service Fixes: Any detected issues can be addressed by the partner before submission, eliminating delays and back-and-forth communication. Final Sign-Off: Once tests pass, partners can confidently publish their images, knowing they meet Azure’s quality standards. Benefits of AITL AITL is a transformative tool that delivers significant benefits across the Linux and cloud ecosystem: Self-Service Capability: Enables developers and ISVs to independently validate their images without requiring direct support from Microsoft. Scalable by Design: Supports concurrent testing of multiple images, driving greater operational efficiency. Consistent and Standardized Testing: Offers a unified validation framework to ensure quality and consistency across all endorsed Linux distributions. Proactive Issue Detection: Identifies potential issues early in the development cycle, helping prevent costly post-deployment fixes. Seamless Pipeline Integration: Easily integrates with existing CI/CD workflows to enable fully automated image validation. Use Cases for AITL AITL designed to support a diverse set of users across the Linux ecosystem: Linux Distribution Partners: Organizations such as Canonical, Red Hat, and SUSE can validate their images prior to publishing on the Azure Marketplace, ensuring they meet Azure’s quality and compliance standards. Independent Software Vendors (ISVs): Companies providing custom Linux Images can verify that their custom Linux-based solutions are optimized for performance and reliability on Azure. Enterprise IT Teams: Businesses managing their own Linux images on Azure can use AITL to validate updates proactively, reducing risk and ensuring smooth production deployments. Current Status and Future Roadmap AITL is currently in private preview, with five major Linux distros and select ISVs actively integrating it into their validation workflows. Microsoft plans to expand AITL’s capabilities by adding: Support for Private Test Cases: Allowing partners to run custom tests within AITL securely. Kernel CI Integration: Enhancing low-level kernel validation for more robust testing and results for community. DPDK and Specialized Validation: Ensuring network and hardware performance for specialized SKU (CVM, HPC) and workloads How to Get Started? For developers and partners interested in AITL, following the steps to onboard. Register for Private Preview AITL is currently hidden behind a preview feature flag. You must first register the AITL preview feature with your subscription so that you can then access the AITL Resource Provider (RP). These are one-time steps done for each subscription. Run the “az feature register” command to register the feature: az feature register --namespace Microsoft.AzureImageTestingForLinux --name JobandJobTemplateCrud Sign Up for Private Preview – Contact Microsoft’s Linux Systems Group to request access. Private Preview Sign Up To confirm that your subscription is registered, run the above command and check that properties.state = “Registered” Register the Resource Provider Once the feature registration has been approved, the AITL Resource Provider can be registered by running the “az provider register” command: az provider register --namespace Microsoft.AzureImageTestingForLinux *If your subscription is not registered to Microsoft.Compute/Network/Storage, please do so. These are also prerequisites to using the service. This can be done for each namespace (Microsoft.Compute, Microsoft.Network, Microsoft.Storage) through this command: az provider register --namespace Microsoft.Compute Setup Permissions The AITL RP requires a permission set to create test resources, such as the VM and storage account. The permissions are provided through a custom role that is assigned to the AITL Service Principal named AzureImageTestingForLinux. We provide a script setup_aitl.py to make it simple. It will create a role and grant to the service principal. Make sure the active subscription is expected and download the script to run in a python environment. https://raw.githubusercontent.com/microsoft/lisa/main/microsoft/utils/setup_aitl.py You can run the below command: python setup_aitl.py -s "/subscriptions/xxxx" Before running this script, you should check if you have the permission to create role definition in your subscription. *Note, it may take up to 20 minutes for the permission to be propagated. Assign an AITL jobs access role If you want to use a service principle or registration application to call AITL APIs. The service principle or App should be assigned a role to access AITL jobs. This role should include the following permissions: az role definition create --role-definition '{ "Name": "AITL Jobs Access Role", "Description": "Delegation role is to read and write AITL jobs and job templates", "Actions": [ "Microsoft.AzureImageTestingForLinux/jobTemplates/read", "Microsoft.AzureImageTestingForLinux/jobTemplates/write", "Microsoft.AzureImageTestingForLinux/jobTemplates/delete", "Microsoft.AzureImageTestingForLinux/jobs/read", "Microsoft.AzureImageTestingForLinux/jobs/write", "Microsoft.AzureImageTestingForLinux/jobs/delete", "Microsoft.AzureImageTestingForLinux/operations/read", "Microsoft.Resources/subscriptions/read", "Microsoft.Resources/subscriptions/operationresults/read", "Microsoft.Resources/subscriptions/resourcegroups/write", "Microsoft.Resources/subscriptions/resourcegroups/read", "Microsoft.Resources/subscriptions/resourcegroups/delete" ], "IsCustom": true, "AssignableScopes": [ "/subscriptions/01d22e3d-ec1d-41a4-930a-f40cd90eaeb2" ] }' You can create a custom role using the above command in the cloud shell, and assign this role to the service principle or the App. All set! Please go through a quick start to try AITL APIs. Download AITL wrapper AITL is served by Azure management API. You can use any REST API tool to access it. We provide a Python wrapper for better experience. The AITL wrapper is composed of a python script and input files. It calls “az login” and “az rest” to provide similar experience like the az CLI. The input files are used for creating test jobs. Make sure az CLI and python 3 are installed. Clone LISA code, or only download files in the folder. lisa/microsoft/utils/aitl at main · microsoft/lisa (github.com). Use the command below to check the help text. python -m aitl job –-help python -m aitl job create --help Create a job Job creation consists of two entities: A job template and an image. The quickest way to get started with the AITL service is to create a Job instance with your job template properties in the request body. Replace placeholders with the real subscription id, resource group, job name to start a test job. This example runs 1 test case with a marketplace image using the tier0.json template. You can create a new json file to customize the test job. The name is optional. If it’s not provided, AITL wrapper will generate one. python -m aitl job create -s {subscription_id} -r {resource_group} -n {job_name} -b ‘@./tier0.json’ The default request body is: { "location": "westus3", "properties": { "jobTemplateInstance": { "selections": [ { "casePriority": [ 0 ] } ] } } } This example runs the P0 test cases with the default image. You can choose to add fields to the request, such as image to test. All possible fields are described in the API Specification – Jobs section. The “location” property is a required field that represents the location where the test job should be created, it doesn’t affect the location of VMs. AITL supports “westus”, “westus2”, or “westus3”. The image object in the request body json is where the image type to be used for testing is detailed, as well as the CPU architecture and VHD Generation. If the image object is not included, LISA will pick a Linux marketplace image that meets the requirements for running the specified tests. When an image type is specified, additional information will be required based on the image type. Supported image types are VHD, Azure Marketplace image, and Shared Image Gallery. - VHD requires the SAS URL. - Marketplace image requires the publisher, offer, SKU, and version. - Shared Image Gallery requires the gallery name, image definition, and version. Example of how to include the image object for shared image gallery. (<> denotes placeholder): { "location": "westus3", “properties: { <...other properties from default request body here>, "image": { "type": "shared_gallery", "architecture": "x64", "vhdGeneration": 2, "gallery": "<Example: myAzureComputeGallery>", "definition": "<Example: myImage1>", "version": "<Example: 1.0.1>" } } } Check Job Status & Test Results A job is an asynchronous operation that is updated throughout the job’s lifecycle with its operation and ongoing tests status. A job has 6 provisioning states – 4 are non-terminal states and 2 are terminal states. Non-terminal states represent ongoing operation stages and terminal states represent the status at completion. The job’s current state is reflected in the `properties.provisioningState` property located in the response body. The states are described below: Operation States State Type Description Accepted Non-Terminal state Initial ARM state describing the resource creation is being initialized. Queued Non-Terminal state The job has been queued by AITL to run LISA using the provided job template parameters. Scheduling Non-Terminal state The job has been taken off the queue and AITL is preparing to launch LISA. Provisioning Non-Terminal state LISA is creating your VM within your subscription using the default or provided image. Running Non-Terminal state LISA is running the specified tests on your image and VM configuration. Succeeded Terminal state LISA completed the job run and has uploaded the final test results to the job. There may be failed test cases. Failed Terminal state There was a failure during the job’s execution. Test results may be present and reflect the latest status for each listed test. Test results are updated in near real-time and can be seen in the ‘properties.results’ property in the response body. Results will begin to get updated during the “Running” state and the final set of result updates will happen prior to reaching a terminal state (“Completed” or “Failed”). For a complete list of possible test result properties, go to the API Specification – Test Results section. Run below command to get detailed test results. python -m aitl job get -s {subscription_id} -r {resource_group} -n {job_name} The query argument can format or filter results by JMESquery. Please refer to help text for more information. For example, List test results and error messages. python -m aitl job get -s {subscription_id} -r {resource_group} -n {job_name} -o table -q 'properties.results[].{name:testName,status:status,message:message}' Summarize test results. python -m aitl job get -s {subscription_id} -r {resource_group} -n {job_name} -q 'properties.results[].status|{TOTAL:length(@),PASSED:length([?@==`"PASSED"`]),FAILED:length([?@==`"FAILED"`]),SKIPPED:length([?@==`"SKIPPED"`]),ATTEMPTED:length([?@==`"ATTEMPTED"`]),RUNNING:length([?@==`"RUNNING"`]),ASSIGNED:length([?@==`"ASSIGNED"`]),QUEUED:length([?@==`"QUEUED"`])}' Access Job Logs To access logs and read from Azure Storage, the AITL user must have “Storage Blob Data Owner” role. You should check if you have the permission to create role definition in your subscription, likely with your administrator. For information on this role and instructions on how to add this permission, see this Azure documentation. To access job logs, send a GET request with the job name and use the logUrl in the response body to retrieve the logs, which are stored in Azure storage container. For more details on interpreting logs, refer to the LISA documentation on troubleshooting test failures. To quickly view logs online (note that file size limitations may apply), select a .log Blob file and click "edit" in the top toolbar of the Blob menu. To download the log, click the download button in the toolbar. Conclusion AITL represents a forward-looking approach to Linux image validation bringing automation, scalability, and consistency to the forefront. By shifting validation earlier in the development cycle, AITL helps reduce risk, accelerate time to market, and ensure a reliable, high-quality Linux experience on Azure. Whether you're a developer, a Linux distribution partner, or an enterprise managing Linux workloads on Azure, AITL offers a powerful way to modernize and streamline your validation workflows. To learn more or get started with AITL or more details and access to AITL, reach out to Microsoft Linux Systems Group
KashanK
Apr 10, 2025 Place Linux and Open Source Blog
1KViews
0likes
0Comments
Linux and Open Source on Azure Quarterly Update - February 2025
As we venture into 2025, it's exhilarating to reflect on the astonishing strides we've made in the domain of Linux and Open Source Software (OSS) on Azure. Let us dive into another edition of the quarterly update to learn more!   Microsoft Ignite 2024 Linux on Azure took center stage at Microsoft Ignite 2024 with dedicated session and a meet-up booth. Our breakout session, theater session, and lab session drew over 500 attendees. This engagement is a testament to the enthusiasm and interest in Linux-based solutions on Azure. Check out the on-demand recording available on the Ignite website: What’s new in Linux: How we’re collaborating to help shape its future We announced that the Azure security baseline through Azure Policy and Machine Configuration for Linux has moved to public preview, and we are expanding the capabilities with built-in auto-remediation feature (limited public preview). Red Hat on Azure announcements at Ignite are captured here. Linux Promotional Offer The promotional offer for the latest Linux VMs in Azure is currently live. For a limited time, you can save an additional 15% on one-year Azure Reserved Virtual Machine (VM) Instances for the latest Linux VMs. This means you could save up to 56% compared to running an Azure VM on a PAYG (pay-as-you-go) basis. This offer is available until March 31, 2025. To learn more, read the blog and refer to the terms and conditions. Azure Linux 3.0 in preview on Azure Kubernetes Service v1.31 We are excited to announce that Azure Linux 3.0, the next major version release of the Azure Linux container host for Azure Kubernetes Service (AKS), is now available in preview on AKS version 1.31. Azure Linux 3.0 offers increased package availability and versions, an updated kernel, and improvements to performance, security, and tooling and developer experience. SUSE LTSS on Azure Marketplace Many of our customers rely on SUSE Linux Enterprise Server (SLES) for running their mission-critical SAP and HPC (high-performance computing) workloads on Azure. We’re excited to share that SUSE Long Term Service Pack Support (LTSS) is available in the Azure Marketplace, providing customers with options for managing the support lifecycle of their SUSE images in Azure. The blog announcement is here. Linux VM Image Quality on Azure In the continuously evolving landscape of cloud computing and AI, the quality and reliability of virtual machines (VMs) plays a vital role for businesses running mission-critical workloads. With over 65% of Azure workloads running Linux, our commitment to delivering high-quality Linux VM images and platforms remains unwavering. Find out how Microsoft ensures the quality of Linux VM images and platform experiences on Azure. Learn how LISA (an open-source tool) enhances the testing and validation processes for Linux kernels and guest OS images on Azure. MIT Technology Review Article We recently commissioned a sponsored article in collaboration with AMD on the topic of “Accelerating AI innovation through application modernization” published on MIT Technology Review. The article delves into AI driving new requirements for application modernization. Red Hat Summit Connects Microsoft’s sponsorship of the Red Hat Summit Connect global event series proved to be a resounding success. Spanning cities from Melbourne to Mexico City, we engaged with over 6,500 attendees. By partnering with key organizations, we reinforced the strength of our strategic alliance with Red Hat. What’s coming up next  Migrate to Innovate Summit This event aims to showcase how cloud migration and modernization can build a platform for AI innovation. In 2.5 hours, the event will feature thought leaders and experts from Microsoft and Intel who will share their perspectives, present real-world case studies, and showcase product demonstrations to help customers accelerate their cloud journey. The event will be live on March 11, 2025. Register to check out the great content! SUSECON 2025 We will be at SUSECON 2025, which will take place in Orlando, Florida, from March 10th – 14th, 2025. We look forward to sharing insights, learning, and collaborating with everyone attending. Discover why Microsoft Azure is a trusted and proven cloud platform and explore the benefits of Azure-optimized solutions co-developed by Microsoft and SUSE for your business-critical Linux workloads. Check out one of the Microsoft sessions and meet with us at our booth. We recently published a recap covering some of Microsoft partners’ latest offerings on Linux and PGSQL. Stay tuned for more updates and thank you for being a part of this journey!
GarimaSingh
Feb 18, 2025 Place Linux and Open Source Blog
715Views
0likes
0Comments